WO2022247126A1 - 视觉定位方法、装置、设备、介质及程序 - Google Patents

视觉定位方法、装置、设备、介质及程序 Download PDF

Info

Publication number
WO2022247126A1
WO2022247126A1 PCT/CN2021/126039 CN2021126039W WO2022247126A1 WO 2022247126 A1 WO2022247126 A1 WO 2022247126A1 CN 2021126039 W CN2021126039 W CN 2021126039W WO 2022247126 A1 WO2022247126 A1 WO 2022247126A1
Authority
WO
WIPO (PCT)
Prior art keywords
landmark
image
sample
point
feature
Prior art date
Application number
PCT/CN2021/126039
Other languages
English (en)
French (fr)
Inventor
章国锋
鲍虎军
黄昭阳
周晗
周晓巍
李鸿升
Original Assignee
浙江商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江商汤科技开发有限公司 filed Critical 浙江商汤科技开发有限公司
Publication of WO2022247126A1 publication Critical patent/WO2022247126A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Definitions

  • the present disclosure relates to the technical field of computer vision, in particular to a visual positioning method, device, equipment, medium and program.
  • the scene coordinate regression method or feature-based visual positioning framework is usually used to construct dense 2D-3D point pairs to restore the camera pose based on dense point pairs.
  • these scene coordinates usually contain a large number of outliers, especially in dynamic environments such as moving objects and lighting changes, the proportion of outliers will increase accordingly, making it impossible to perform stable and reliable visual positioning. In view of this, how to improve the accuracy and robustness of visual localization has become an urgent problem to be solved.
  • the disclosure provides a visual positioning method, device, equipment, medium and program.
  • An embodiment of the present disclosure provides a visual positioning method, the method is executed by an electronic device, and the method includes: acquiring an image to be positioned captured by a preset scene; performing landmark detection on the image to be positioned to obtain a target in the image to be positioned Landmark point; wherein, the target landmark point is at least one of several landmark points in the preset scene, and the several landmark points are selected from the scene map of the preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene , and several landmark points are respectively located at the preset positions of each sub-area of the scene map; based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map, the image to be positioned is obtained pose parameters.
  • several sub-areas are obtained by dividing the surface of the scene map; and/or, the preset position includes the central position of the sub-areas; and/or, the area difference between each sub-area is low at the first threshold. Therefore, several sub-regions are obtained by dividing the surface of the scene map.
  • the image to be positioned is usually an imaging of the preset scene surface, it can help improve the accuracy of the target landmark points detected in the image to be positioned;
  • setting the preset position to include the center position of the sub-region can help to improve the uniform distribution of landmark points and improve the quality of point pairs; in addition, setting the area difference between each sub-region to be lower than the first threshold , which can help improve the uniform distribution of landmark points and improve the quality of point pairs.
  • performing landmark detection on the image to be positioned to obtain a target landmark point in the image to be positioned includes: using a landmark detection model to process the image to be positioned, and predicting and obtaining a first landmark prediction image and a first direction prediction image; Analyzing the first landmark prediction image and the first direction prediction image to obtain target landmark points; wherein, the first landmark prediction image includes predicted landmark attributes of pixels in the image to be positioned, and the first direction prediction image includes pixel points in the image to be positioned The first direction attribute of the point, the predicted landmark attribute is used to identify the landmark point corresponding to the pixel point, the first direction attribute includes the first direction information pointing to the landmark projection, and the landmark projection indicates the projection of the landmark point corresponding to the pixel point in the image to be positioned Location.
  • the first predicted landmark image and the first direction predicted image are obtained, the first predicted landmark image includes the predicted landmark attributes of the pixels in the image to be positioned, and the first predicted direction image includes The first direction attribute of the pixel in the image, the predicted landmark attribute is used to identify the landmark point corresponding to the pixel point, the first direction attribute includes the first direction information pointing to the landmark projection, and the landmark projection indicates that the landmark point corresponding to the pixel point is in the image to be located projection position in .
  • the first landmark prediction image and the first direction prediction image are analyzed to obtain the target landmark points, because the first landmark prediction image includes the landmark points corresponding to each pixel, and the first direction prediction image includes each The pixel point points to the direction information of the landmark projection, so it can greatly reduce the impact of the dynamic environment and improve the robustness of positioning.
  • analyzing the first landmark prediction image and the first direction prediction image to obtain the target landmark point includes: obtaining a candidate area composed of pixels with the same predicted landmark attribute; counting the candidate area Consistency of the first direction attribute of the pixel; when the consistency satisfies the preset condition, the landmark point identified by the predicted landmark attribute of the pixel in the candidate area is used as the target landmark point, and based on the pixel in the candidate area The first direction attribute of the point is used to obtain the first position information of the target landmark point in the image to be located.
  • the candidate area composed of pixels with the same predicted landmark attribute, and counting the consistency of the first direction attribute of the pixel in the candidate area, when the consistency meets the preset condition.
  • the landmark points identified by the predicted landmark attributes of the pixels in the center are used as target landmark points, and based on the first direction attribute of the pixel points in the candidate area, the first position information of the target landmark point in the image to be located is obtained. That is, before determining the target landmark point based on the predicted landmark attribute of the pixel in the candidate area, first detect the consistency of the first direction attribute of the pixel in the candidate area, which can help ensure the first direction of the pixel in the candidate area.
  • the consistency of the direction attribute can improve the quality of the subsequent point pairs, which in turn can help improve the accuracy and robustness of visual positioning.
  • the method before counting the consistency of the first direction attribute of the pixels in the candidate area, the method further includes: filtering the candidate area when the area of the candidate area is smaller than the second threshold. Therefore, before counting the consistency of the first direction attribute of the pixels in the candidate area, the area area of the candidate area is detected first, and when the area area is smaller than the second threshold, the candidate area is filtered, by pre-filtering the area area A candidate area that is too small can help filter out unstable areas in advance, help improve the quality of subsequent constructed point pairs, and in turn help improve the accuracy and robustness of visual positioning.
  • the first direction information includes a first direction vector; and the statistics of the consistency of the first direction attributes of the pixels in the candidate area include: obtaining the first direction vector between the pixels in the candidate area The intersection point; the outlier rate of the intersection point is counted to obtain the consistency. Therefore, the first direction information includes the first direction vector, by obtaining the intersection of the first direction vector between pixels in the candidate area, and counting the outlier rate of the intersection to obtain the consistency, that is, the consistency can effectively reflect the candidate area
  • the overall prediction quality of the first direction attribute of the pixels in the middle is conducive to improving the quality of the subsequently constructed point pairs, which in turn can help improve the accuracy and robustness of visual positioning.
  • the landmark detection model includes a feature extraction network, a landmark prediction image, and a direction prediction network; the image to be positioned is processed by the landmark detection model, and the first landmark prediction image and the first direction prediction image are obtained by prediction, including : Use the feature extraction network to perform feature extraction on the image to be positioned to obtain the feature image; use the landmark prediction network to perform landmark prediction on the feature image to obtain the first landmark prediction image; and use the direction prediction network to perform direction prediction on the feature image to obtain the first Orientation predicted image.
  • the landmark detection model includes a feature extraction network, a landmark prediction network, and a direction prediction network, so that the feature extraction network is used to extract features from the image to be positioned to obtain a feature image, and the landmark prediction network is used to perform landmark prediction on the feature image to obtain the first landmark Predict the image, and use the direction prediction network to predict the direction of the feature image to obtain the first direction prediction image, that is, the landmark prediction network and the direction prediction network are responsible for predicting the landmarks and directions respectively, and the landmark prediction network and the direction prediction network share the feature extraction.
  • the obtained feature image can help improve the prediction efficiency.
  • using a landmark prediction network to perform landmark prediction on a feature image to obtain a first landmark prediction image includes: using a landmark prediction network to decode the feature image to obtain a first feature prediction image; wherein, the first The feature prediction image includes the first feature representation of the pixel in the image to be located; based on the similarity between the first feature representation of the pixel and the landmark feature representation of each landmark point, the predicted landmark attribute of the pixel point is obtained; wherein, the landmark The feature representation is obtained after the landmark detection model is trained and converged; based on the predicted landmark attributes of each pixel in the image to be located, the first landmark predicted image is obtained.
  • a first feature prediction image is obtained, and the first feature prediction image includes a first feature representation of a pixel point in the image to be located.
  • the predicted landmark attribute of the pixel point is obtained, and the landmark feature representation is obtained after the convergence of the landmark detection model training, Based on the predicted landmark attributes of each pixel in the image to be positioned, the first landmark prediction image is obtained.
  • the landmark feature representation obtained after the landmark detection model training converges can accurately characterize the landmark point features, the first feature of the predicted pixel point representation, and based on the similarity between the first feature representation and each landmark feature representation, the predicted landmark attribute of the pixel is obtained, which can help improve the accuracy of the predicted landmark attribute.
  • the target landmark point is detected by using a landmark detection model
  • the training step of the landmark detection model includes: respectively determining the projection area and projection position of the sub-region and the landmark point in the sample image; based on the projection area and The projection position determines the sample landmark attribute and sample direction attribute of the sample pixel point in the sample image; wherein, the sample landmark attribute is used to identify the sample landmark point corresponding to the sample pixel point, and the sample landmark point is a sub-area where the projection area covers the sample pixel point
  • the sample direction attribute includes sample direction information pointing to the projected position of the sample landmark point corresponding to the sample pixel point; based on the sample landmark attribute and the sample direction attribute respectively, the sample landmark image and the sample direction image of the sample image are obtained; where , the first pixel in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel, and the second pixel in the sample direction image is marked with the sample direction attribute of the corresponding sample pixel; using the sample image, the
  • the target landmark point is detected by using the landmark detection model.
  • the sample pixel point in the sample image is determined.
  • Landmark attribute and sample square attribute, and the sample landmark attribute is used to identify the sample landmark point corresponding to the sample pixel point
  • the sample landmark point is the landmark point contained in the sub-area covering the sample pixel point in the projection area
  • the sample direction attribute includes pointing to the sample pixel point The sample direction information of the projected position of the corresponding sample landmark point.
  • the sample landmark image and the sample direction image of the sample image are obtained, and the first pixel in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel point, the sample The second pixel in the direction image is marked with the sample direction attribute of the corresponding sample pixel. Therefore, training samples can be accurately constructed, and then the landmark detection model can be trained by using the sample image, the sample landmark image and the sample orientation image, which can help improve the detection performance of the landmark detection model.
  • using the sample image, the sample landmark image and the sample direction image to train the landmark detection model includes: using the landmark detection model to predict the sample image, and obtaining the second feature prediction image and the second direction of the sample image A predicted image; wherein the second feature predicted image includes a second feature representation of the sample pixel, the second direction predicted image includes a second direction attribute of the sample pixel, and the second direction attribute includes second direction information pointing to the sample landmark projection, And the sample landmark projection represents the projected position of the sample landmark point in the sample image; based on the sample landmark image and the second feature prediction image, the first loss is obtained, and the difference between the sample orientation image and the second orientation prediction image is used to obtain the first loss The second loss; based on the first loss and the second loss, optimize the network parameters of the landmark detection model.
  • the sample image is predicted by using the landmark detection model to obtain the second feature prediction image and the second direction prediction image of the sample image, and the second feature image includes the second feature representation of the sample pixel, and the second direction prediction image includes the sample
  • the first loss is obtained based on the sample landmark image and the second feature prediction image
  • the second loss is obtained by using the difference between the sample orientation image and the second orientation prediction image.
  • the network of the landmark detection model can be optimized, and then the training of the landmark detection model can be supervised through the pre-built sample landmark images and sample orientation images, which is conducive to improving the detection performance of the landmark detection model.
  • the first loss is obtained based on the sample landmark image and the second feature prediction image, which includes: obtaining the image area formed by the sample pixel points with the same sample landmark attribute, and obtaining each landmark point to be Optimize feature representation; for sample pixels in the image area, use the feature representation to be optimized of the sample landmark identified by the sample landmark attribute as the positive example feature representation of the sample pixel, and select a reference feature representation as the sample pixel negative example Feature representation, and based on the first similarity between the second feature representation and the positive example feature representation and the second similarity between the second feature representation and the negative example feature representation, a sub-loss is obtained; wherein, the reference feature representation includes except A feature representation to be optimized other than the positive feature representation; based on the sub-loss of the sample pixel in the sample image, the first loss is obtained.
  • the image region composed of sample pixel points with the same sample landmark attributes is obtained, and the feature representation to be optimized for each landmark point is obtained. Therefore, for the sample pixels in the image area, the feature representation to be optimized of the sample landmark identified by the sample landmark attribute is used as the positive example feature representation of the sample pixel point, and a reference feature representation is selected as the negative example feature representation of the sample pixel point, And the reference feature representation includes the feature representation to be optimized except the positive feature representation, and then based on the first similarity between the second feature representation and the positive feature representation and the first similarity between the second feature representation and the negative feature representation Two similarities, to obtain the sub-loss, to obtain the first loss based on the sub-loss of the sample pixel in the sample image, so, on the one hand, by minimizing the first loss, the second feature representation can be made as close as possible to its positive example Feature representation and its negative example feature representation as far as possible, improve the prediction performance of the landmark prediction network, on the other hand, by selecting a reference feature representation as a negative example feature representation
  • selecting a reference feature representation as the negative example feature representation of the sample pixel includes: counting the average feature representation of the second feature representation of the sample pixel in the image area; Referring to the similarity between feature representations, select several reference feature representations as candidate feature representations of the image region; sample uniformly in the candidate feature representations to obtain negative example feature representations of sample pixels. Therefore, the average feature representation of the second special representation of sample pixels in the image area is counted, and based on the similarity between the average feature representation and each reference feature representation, several reference feature representations are selected as candidate feature representations of the image area. In this way, uniform sampling is performed in the candidate feature representation, and the negative example feature representation of the sample pixel is obtained.
  • the average feature representation can represent the overall feature representation of the image area
  • the optional reference feature representation of the pixels in the image area is selected through the average feature representation , and then uniformly sampled in the reference feature representation to obtain the negative example feature representation of the sample pixels.
  • it can help to improve the reference meaning of the reference feature representation, and on the other hand, it can help reduce the selection of each sample pixel in the image area.
  • the complexity of negative feature representations can represent the overall feature representation of the image area.
  • optimizing the network parameters of the landmark detection model based on the first loss and the second loss includes: optimizing the feature representation and the landmark detection model of each landmark point based on the first loss and the second loss network parameters. Therefore, based on the first loss and the second loss, the features to be optimized for each landmark point and the network parameters of the landmark detection model are optimized, so the network parameters of the landmark detection model and the feature representations to be optimized for each landmark point can be optimized simultaneously during the training process , which is conducive to improving the accuracy and robustness of landmark detection, which can help to improve the quality of point pairs, and then can help to improve the accuracy and robustness of visual positioning.
  • An embodiment of the present disclosure provides a visual positioning device, including: an information acquisition module, a landmark detection module, and a pose determination module.
  • the target landmark point in the image to be positioned is obtained; wherein, the target landmark point is at least one of several landmark points in the preset scene, and the several landmark points are selected from the scene map of the preset scene , the scene map is obtained by three-dimensional modeling of the preset scene, and several landmark points are respectively located at the preset positions of each sub-area of the scene map;
  • the pose determination module is configured to be based on the target landmark point in the image to be positioned
  • the first position information of the target landmark point and the second position information of the target landmark point in the scene map are used to obtain the pose parameters of the image to be positioned.
  • several sub-areas are obtained by dividing the surface of the scene map; and/or, the preset position includes the central position of the sub-areas; and/or, the area difference between each sub-area is low at the first threshold.
  • the landmark detection module includes: an image processing submodule configured to use the landmark detection model to process the image to be positioned, and obtain the first predicted landmark image and the first direction predicted image through prediction; the image analysis submodule, It is configured to analyze the first landmark prediction image and the first direction prediction image to obtain target landmark points; wherein, the first landmark prediction image includes predicted landmark attributes of pixels in the image to be positioned, and the first direction prediction image includes the image to be positioned The first direction attribute of the pixel point, the predicted landmark attribute is used to identify the landmark point corresponding to the pixel point, the first direction attribute includes the first direction information pointing to the landmark projection, and the landmark projection indicates that the landmark point corresponding to the pixel point is in the image to be positioned projection position.
  • the image analysis submodule includes: a candidate area acquisition unit configured to acquire a candidate area formed by pixels with the same predicted landmark attribute; a consistency statistics unit configured to count the pixels in the candidate area The consistency of the first direction attribute of the point; the landmark determination unit is configured to use the landmark point identified by the predicted landmark attribute of the pixel point in the candidate area as the target landmark point when the consistency meets the preset condition, and Based on the first direction attribute of the pixel points in the candidate area, the first position information of the target landmark point in the image to be positioned is obtained.
  • the image analysis submodule further includes: a candidate region filtering unit configured to filter the candidate region when the region area of the candidate region is smaller than a second threshold.
  • the first direction information includes a first direction vector; the consistency statistics unit is further configured to obtain the intersection of the first direction vector between pixels in the candidate area; count the outlier rate of the intersection, get consistency.
  • the landmark detection model includes a feature extraction network, a landmark prediction network, and a direction prediction network
  • the image processing submodule includes: a feature extraction unit configured to use the feature extraction network to perform feature extraction on the image to be positioned, and obtain The feature image; the landmark prediction unit configured to use the landmark prediction network to perform landmark prediction on the feature image to obtain the first landmark prediction image; the direction prediction unit is configured to use the direction prediction network to perform direction prediction on the feature image to obtain the first direction prediction image .
  • the landmark prediction unit is further configured to use the landmark prediction network to decode the feature image to obtain a first feature prediction image; wherein, the first feature prediction image includes the first pixel point in the image to be positioned feature representation; based on the similarity between the first feature representation of the pixel point and the landmark feature representation of each landmark point, the predicted landmark attribute of the pixel point is obtained; wherein, the landmark feature representation is obtained after the landmark detection model is trained and converged; Based on the predicted landmark attributes of each pixel in the image to be positioned, a first landmark predicted image is obtained.
  • the target landmark point is detected by using a landmark detection model
  • the visual positioning device further includes: a projection acquisition module configured to respectively determine the projection area and projection position of the sub-region and the landmark point in the sample image;
  • the attribute determination module is configured to determine the sample landmark attribute and the sample direction attribute of the sample pixel point in the sample image based on the projection area and the projection position; wherein, the sample landmark attribute is used to identify the sample landmark point corresponding to the sample pixel point, and the sample landmark point
  • the sample direction attribute includes sample direction information pointing to the projection position of the sample landmark point corresponding to the sample pixel point;
  • the sample acquisition module is configured to be based on the sample landmark attribute and sample respectively Direction attribute, to obtain the sample landmark image and the sample orientation image of the sample image; wherein, the first pixel in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel, and the second pixel in the sample orientation image is marked
  • the model training module includes: an image prediction sub-module configured to use a landmark detection model to predict the sample image to obtain a second feature prediction image and a second direction prediction image of the sample image; wherein, the first The two-feature prediction image includes the second feature representation of the sample pixel, the second direction prediction image includes the second direction attribute of the sample pixel, the second direction attribute includes the second direction information pointing to the sample landmark projection, and the sample landmark projection represents the sample The projection position of the landmark point in the sample image; the loss calculation submodule is configured to obtain the first loss based on the sample landmark image and the second feature prediction image, and use the difference between the sample orientation image and the second orientation prediction image to obtain The second loss; a parameter optimization submodule configured to optimize the network parameters of the landmark detection model based on the first loss and the second loss.
  • the loss calculation submodule includes: an image area and feature representation acquisition unit, configured to acquire an image area composed of sample pixels with the same sample landmark attributes, and acquire features to be optimized for each landmark point Representation; a sub-loss calculation unit configured to, for a sample pixel point in an image region, use the feature representation to be optimized of the sample landmark point identified by the sample landmark attribute as a positive example feature representation of the sample pixel point, and select a reference feature representation as a sample The negative example feature representation of the pixel point, and based on the first similarity between the second feature representation and the positive example feature representation and the second similarity between the second feature representation and the negative example feature representation, a sub-loss is obtained; where, The reference feature representation includes feature representations to be optimized except the positive example feature representation; the loss statistics unit is configured to obtain the first loss based on the sub-loss of the sample pixel points in the sample image.
  • the sub-loss calculation unit is further configured to count the average feature representation of the second feature representation of sample pixels in the image region; based on the similarity between the average feature representation and each reference feature representation, Select several reference feature representations as the candidate feature representations of the image region; sample uniformly in the candidate feature representations to obtain the negative example feature representations of the sample pixels.
  • the parameter optimization submodule is further configured to optimize the feature representation to be optimized for each landmark point and the network parameters of the landmark detection model based on the first loss and the second loss.
  • An embodiment of the present disclosure provides an electronic device, including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory, so as to implement the above visual positioning method.
  • An embodiment of the present disclosure provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, the above visual positioning method is implemented.
  • An embodiment of the present disclosure also provides a computer program, the computer program includes computer readable codes, and when the computer readable codes run in an electronic device, the processor of the electronic device executes the above-mentioned visual positioning method .
  • the visual positioning method, device, equipment, medium, and program provided by the embodiments of the present disclosure obtain target landmark points in the image to be positioned by acquiring the image to be positioned captured by the preset scene and performing landmark detection on the image to be positioned, and the target The landmark point is at least one of several landmark points in the preset scene.
  • the several landmark points are selected from the scene map of the preset scene.
  • the scene map is obtained by performing three-dimensional modeling on the preset scene.
  • the several landmark points are respectively located in The preset positions of each sub-area of the scene map.
  • the pose parameters of the image to be positioned are obtained.
  • the landmark points are not disorderly and disorderly, and they have the characteristics of uniform distribution
  • the target landmark point detected in the image to be located is at least one of the several landmark points , and in the subsequent visual positioning process, it only depends on the point pair consisting of the two-dimensional position of the target landmark point in the image to be positioned and the three-dimensional position in the scene map, and no longer depends on other point pairs that have nothing to do with the landmark point . Therefore, the quality of the point pairs can be improved while reducing the number of point pairs, which in turn can help improve the accuracy and robustness of visual positioning.
  • FIG. 1 is a schematic flow chart of an embodiment of the disclosed visual positioning method
  • Fig. 2 is a schematic diagram of an embodiment of a scene map
  • Fig. 3 is a schematic diagram of an embodiment of detecting a target landmark point using a landmark detection model
  • Fig. 4 is a schematic diagram of an embodiment of a positioning target landmark point
  • FIG. 5 is a schematic diagram of a system architecture applying a visual positioning method according to an embodiment of the present disclosure
  • Fig. 6 is a schematic flow chart of an embodiment of step S12 in Fig. 1;
  • FIG. 7 is a schematic diagram of an embodiment of using SIFT features for visual positioning
  • Fig. 8 is a schematic diagram of an embodiment of using landmark points for visual positioning
  • Fig. 9 is a schematic diagram of an embodiment of a first landmark prediction image
  • Fig. 10 is a schematic diagram of an embodiment of a predicted image in the first direction
  • FIG. 11 is a schematic flow diagram of an embodiment of training a landmark detection model
  • Fig. 12 is a schematic diagram of an embodiment of calculating the first loss
  • Fig. 13 is a schematic frame diagram of an embodiment of the visual positioning device of the present disclosure.
  • Fig. 14 is a schematic frame diagram of an embodiment of an electronic device of the present disclosure.
  • FIG. 15 is a block diagram of an embodiment of a computer-readable storage medium of the present disclosure.
  • system and “network” are often used interchangeably herein.
  • the term “and/or” in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
  • the character "/” in this article generally indicates that the contextual objects are an “or” relationship.
  • “many” herein means two or more than two.
  • FIG. 1 is a schematic flowchart of an embodiment of a visual positioning method of the present disclosure. May include the following steps:
  • Step S11 Obtain the image to be positioned captured by the preset scene.
  • the preset scenario may be set according to actual application requirements. For example, when visual positioning needs to be realized in scenic spots, the preset scene can include scenic spots; or, when visual positioning needs to be realized in commercial streets, the preset scene can include commercial streets; or, when it needs to realize visual positioning in industrial parks In the case of visual positioning, the preset scene can include an industrial park. Other situations can be deduced by analogy, and no more examples will be given here.
  • the image to be positioned may be obtained by shooting a preset scene from any angle of view.
  • the image to be positioned may be obtained by shooting the preset scene upward; or the image to be positioned may be obtained by shooting the preset scene from above; or the image to be positioned may be obtained by shooting the preset scene horizontally.
  • the angle between the optical axis of the camera and the horizontal plane should be lower than the preset angle threshold when shooting the preset scene, that is, the image to be positioned should contain as many preset values as possible. Set the scene, and include invalid areas such as the ground and sky as little as possible.
  • Step S12 Perform landmark detection on the image to be positioned to obtain target landmark points in the image to be positioned.
  • the target landmark point is at least one of several landmark points in the preset scene
  • the several landmark points are selected from the scene map of the preset scene
  • the scene map is a 3D model of the preset scene The model is obtained, and several landmark points are respectively located at the preset positions of each sub-area of the scene map.
  • the shooting video of the preset scene may be collected in advance, and the shooting video may be processed using a three-dimensional reconstruction algorithm to obtain a scene map of the preset scene.
  • the 3D reconstruction algorithm may include but not limited to: Multi View stereo, Kinect fusion, etc., which are not limited here.
  • the implementation process of the 3D reconstruction algorithm please refer to the technical details of the algorithm.
  • the surface of the scene map may be divided into several sub-regions by using a three-dimensional over-segmentation algorithm (eg, supervoxel).
  • a three-dimensional over-segmentation algorithm eg, supervoxel.
  • FIG. 2 is a schematic diagram of an embodiment of a scene map. As shown in Figure 2, different grayscale regions represent different subregions of the scene map surface.
  • the preset position may include a center position of the sub-region.
  • the black dots in the sub-area represent the landmark points determined in this sub-area.
  • the area difference between the sub-regions may be lower than the first threshold, and the first region may be set according to the actual situation, such as: 10 pixels, 15 pixels, 20 pixels, etc., It is not limited here. That is, the respective sub-regions have similar sizes.
  • the landmark points are evenly distributed on the surface of the scene map, so that no matter what angle of view is used to shoot the preset scene to be positioned
  • Both the image and the image to be positioned contain enough landmark points, which can help improve the robustness of visual positioning.
  • a landmark detection model in order to improve the efficiency and accuracy of landmark detection, can be pre-trained, so that the landmark detection model can be used to detect and analyze the image to be positioned to obtain the target landmark point in the image to be positioned.
  • several landmark points of the preset scene can be marked as ⁇ q 1 ,q 2 ,...,q n ⁇ , and the target landmark points can be at least one.
  • the first landmark prediction image and the first direction prediction image can be obtained after the landmark detection model is used to process the image to be positioned.
  • the first landmark prediction image includes the image to be positioned The predicted landmark attribute of the pixel in the first direction.
  • the first direction predicted image includes the first direction attribute of the pixel in the image to be positioned.
  • the predicted landmark attribute is used to identify the landmark point corresponding to the pixel point.
  • the first direction attribute includes the first orientation pointing to the landmark projection.
  • the landmark projection indicates the projection position of the landmark point corresponding to the pixel point in the image to be positioned. On this basis, the first landmark prediction image and the first direction prediction image are analyzed to obtain the target landmark point.
  • the first landmark prediction image includes the landmark points corresponding to each pixel
  • the first direction prediction image includes the direction information of each pixel pointing to the landmark projection
  • the landmark detection model can include a feature extraction network, a landmark prediction network, and a direction prediction network. Then, the feature extraction network can be used to perform feature extraction on the image to be positioned to obtain a feature image, and the landmark prediction network can be used to perform landmark detection on the feature image.
  • Predict obtain the first landmark prediction image, and use the direction prediction network to predict the direction of the feature image to obtain the first direction prediction image, that is, the landmark prediction network and the direction prediction network are responsible for predicting the landmark and direction respectively, and the landmark prediction network and direction
  • the prediction network shares the feature images extracted by the feature extraction, so it can help improve the prediction efficiency.
  • pixels with the same predicted landmark attribute are displayed in the same grayscale, that is, in the first landmark predicted image shown in FIG. 3 , with the same Pixels displayed in gray scale correspond to the same landmark point (for example, a certain landmark point among the aforementioned several landmark points ⁇ q 1 , q 2 , . . . , q n ⁇ ).
  • different gray levels may be used to represent the direction prediction attributes of pixels. As shown in the example in FIG.
  • the 0-degree direction, the 45-degree direction, the 90-degree direction, the 135-degree direction, the 180-degree direction, the 225-degree direction, the 270-degree direction, and the 315-degree direction are represented by different grayscales.
  • the first landmark prediction image and the first direction prediction image shown in Figure 3 are only a possible form of expression in the actual application process. By representing the predicted landmark attributes and predicted direction attributes with different gray levels, it is possible to realize the Predictive visualization of detection models.
  • the output results of the landmark prediction network and the direction prediction network can also be directly represented by numbers, which is not limited here.
  • FIG. 4 is a schematic diagram of an embodiment of locating target landmarks.
  • the hollow circle in the figure represents the target landmark point located in the image to be positioned
  • the rectangular frame area in the lower right corner is an enlarged schematic diagram of the rectangular frame area in the upper left corner, as shown in the rectangular frame area in the lower right corner in Figure 4
  • the pixels with the same gray level represent the same predicted landmark attribute
  • the direction arrow represents the predicted direction attribute of the pixel point.
  • the target landmark point identified by the predicted landmark attribute (such as a certain landmark point in ⁇ q 1 ,q 2 ,...,q n ⁇ ) can be determined, and based on these predicted landmark points with the same predicted
  • the predicted direction attribute of the pixel point of the attribute determines the position information of the target landmark point in the image to be located (for example, the position shown by the solid circle in the figure).
  • the position information of the target landmark point in the image to be positioned may be determined by determining the intersection point of the direction arrows shown in FIG. 4 .
  • both the first landmark prediction image and the first direction prediction image may have the same size as the image to be positioned; or, at least one of the first landmark prediction image and the first direction prediction image may also be the same size as the image to be positioned Dimensions vary.
  • DeepLabV3 can be used as the backbone network of the landmark detection model, which can significantly expand the receptive field through spatial pyramid pooling.
  • Step S13 Based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map, obtain the pose parameters of the image to be positioned.
  • the first position information of the target landmark point in the image to be positioned may be two-dimensional coordinates
  • the second position information of the target landmark point in the scene map may be three-dimensional coordinates.
  • the landmark points are selected from the scene map of the preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene, so the second position information of the landmark point in the scene map can be directly Determined based on the scene map.
  • determine the landmark point corresponding to the target landmark point among the several landmark points and use the second position information of the corresponding landmark point as The second location information of the target landmark point. Please refer to Fig.
  • the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map can be used.
  • the pose parameters eg, 6-DOF parameters
  • the PnP algorithm based on Random Sample Consensus can be used to obtain the pose parameters.
  • RANSAC PnP Random Sample Consensus
  • the target landmark point in the image to be positioned is obtained, and the target landmark point is at least one of several landmark points in the preset scene
  • the several landmark points are selected from the scene map of the preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene, and the several landmark points are respectively located at preset positions of each sub-area of the scene map.
  • the pose parameters of the image to be positioned are obtained.
  • the landmark points are not disorderly and disorderly, and they have the characteristics of uniform distribution
  • the target landmark point detected in the image to be located is at least one of the several landmark points , and in the subsequent visual positioning process, it only depends on the point pair consisting of the two-dimensional position of the target landmark point in the image to be positioned and the three-dimensional position in the scene map, and no longer depends on other point pairs that have nothing to do with the landmark point . Therefore, the quality of the point pairs can be improved while reducing the number of point pairs, which in turn can help improve the accuracy and robustness of visual positioning.
  • FIG. 5 shows a schematic diagram of a system architecture to which the visual positioning method of the embodiment of the present disclosure can be applied; as shown in FIG. 5 , the system architecture includes: an image acquisition terminal 501 , a network 502 and a pose parameter determination terminal 503 .
  • the image acquisition terminal 501 and the pose parameter determination terminal 503 establish a communication connection through the network 502, the image acquisition terminal 501 reports the image to be positioned to the pose parameter determination terminal 503 through the network 502, and the pose parameter determination terminal 503 Perform landmark detection on the image to be positioned to obtain the target landmark point in the image to be positioned; based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map, obtain the target landmark point of the image to be positioned Pose parameters.
  • the terminal 503 for determining the pose parameters uploads the pose parameters of the image to be positioned to the network 502 , and sends them to the image acquisition terminal 501 through the network 502 .
  • the image acquisition terminal 501 may include an image acquisition device, and the pose parameter determination terminal 503 may include a vision processing device or a remote server capable of processing visual information.
  • the network 502 may be connected in a wired or wireless manner.
  • the pose parameter determination terminal 503 is a vision processing device
  • the image acquisition terminal 501 can communicate with the vision processing device through a wired connection, such as performing data communication through a bus;
  • the pose parameter determination terminal 503 is a remote server
  • the image acquisition terminal 501 can perform data interaction with the remote server through the wireless network.
  • the image acquisition terminal 501 may be a vision processing device with a video acquisition module, or a host with a camera.
  • the visual positioning method of the embodiment of the present disclosure may be executed by the image acquisition terminal 501 , and the above-mentioned system architecture may not include the network 502 and the pose parameter determination terminal 503 .
  • FIG. 6 is a schematic flowchart of an embodiment of step S12 in FIG. 1 . As shown in Figure 6, the following steps may be included:
  • Step S61 Using the landmark detection model to process the image to be positioned, and predicting to obtain a first landmark prediction image and a first direction prediction image.
  • the first landmark prediction image includes predicted landmark attributes of pixels in the image to be positioned
  • the first direction prediction image includes first direction attributes of pixels in the image to be positioned
  • the predicted landmark attributes are used to identify pixels
  • the first direction attribute includes first direction information pointing to the landmark projection
  • the landmark projection indicates the projected position of the landmark point corresponding to the pixel point in the image to be positioned.
  • both the first landmark prediction image and the first direction prediction image may have the same size as the image to be positioned, or at least one of the first landmark prediction image and the first direction prediction image may be different in size from the image to be positioned.
  • the predicted landmark attributes can include the labels of the landmark points corresponding to the pixels, that is, in When the predicted landmark attribute includes i, the landmark point corresponding to the pixel point is q i .
  • the first direction information may include a first direction vector, where the first direction vector points to the landmark projection.
  • the first direction vector predicted by the landmark detection model may accurately point to the landmark projection.
  • the detection performance of the landmark detection model may not be perfect due to various factors.
  • the first direction vector predicted by the landmark detection model may not accurately point to the landmark projection, such as the first
  • There may be a certain angle deviation (such as 1 degree, 2 degrees, 3 degrees, etc.) between the position pointed by the direction vector and the projection of the landmark. Since each pixel in the image to be positioned can be predicted to obtain a first direction vector, so The possible directional deviation of a single first directional vector can be corrected through the first directional vectors of multiple pixels, and the process can refer to the following related description.
  • the landmark detection model may include a feature extraction network, a landmark prediction network, and a direction prediction network, and then the feature extraction network may be used to perform feature extraction on the image to be positioned to obtain a feature image, and use
  • the landmark prediction network performs landmark prediction on the feature image to obtain a first landmark prediction image, and uses the direction prediction network to perform direction prediction on the feature image to obtain a first direction prediction image. That is to say, the landmark prediction network and the direction prediction network can share the feature image extracted by the feature extraction network, which can refer to the relevant description of the aforementioned disclosed embodiments.
  • the first direction information may include a first direction vector, and the first direction vector may be a unit vector with a modulus value of 1.
  • the feature image can be decoded by using the landmark prediction network to obtain a first feature prediction image, and the first feature prediction image includes a first feature representation of pixels in the image to be located.
  • the predicted landmark attribute of the pixel can be obtained based on the similarity between the first feature representation of the pixel point and the landmark feature representation of each landmark point, and the landmark feature representation is obtained after the landmark detection model training converges , and based on the predicted landmark attributes of each pixel in the image to be positioned, the first landmark predicted image is obtained.
  • a landmark feature representation set P can be maintained and updated, and the landmark feature representation set P includes each landmark point (for example, the aforementioned ⁇ q 1 , q 2 , ... ,q n ⁇ ) feature representation to be optimized, after the landmark detection model training converges, the feature information of each landmark point in the preset scene can be learned, and these feature information are reflected in the feature representation to be optimized after the convergence of each landmark point .
  • the feature representation to be optimized for training convergence can be called the landmark feature representation.
  • the similarity between the first feature representation of the pixel point and the landmark feature representations of each landmark point (for example, the aforementioned ⁇ q 1 ,q 2 ,...,q n ⁇ ) can be calculated, and The landmark point corresponding to the highest similarity is selected as the landmark point corresponding to the pixel point, so that the landmark point can be used to identify the pixel point, and the predicted landmark attribute of the pixel point can be obtained.
  • the inner product between the first feature representation of the pixel and the landmark feature representation of each landmark point can be calculated, and the label (such as 1 , 2, ..., n, etc.) to identify the landmark point to obtain the predicted landmark attribute.
  • the first predicted landmark image can be obtained.
  • the pixel point can be considered It is an invalid pixel point irrelevant to the preset scene (such as the sky, the ground, etc.), in this case, a special mark (such as 0) can be used for identification.
  • Step S62 Analyze the first landmark prediction image and the first direction prediction image to obtain target landmark points.
  • a candidate region formed by pixels with the same predicted landmark attribute can be obtained, that is, an image region formed by pixels corresponding to the same landmark point can be used as a candidate through the predicted landmark attribute of the pixel point area.
  • the consistency of the first direction attribute of the pixels in the candidate area can be counted, that is to say, for each candidate area, the consistency of the first direction attribute of the pixel in the candidate area can be counted, Thus, the consistency of each candidate region can be obtained.
  • the landmark points identified by the predicted landmark attributes of the pixels in the candidate area can be used as the target landmark points, and based on the first direction attribute of the pixel points in the candidate area, the target The first position information of the landmark point in the image to be located.
  • first detect the consistency of the first direction attribute of the pixel point in the candidate area which can help to ensure the accuracy of the pixel point in the candidate area.
  • the consistency of the attributes of the first direction improves the quality of the subsequently constructed point pairs, which in turn helps to improve the accuracy and robustness of visual positioning.
  • the candidate area in order to improve the accuracy and robustness of visual positioning, before counting the consistency of the first direction attribute of the pixels in the candidate area, it is also possible to first detect whether the area of the candidate area is smaller than the second threshold , if the area of the candidate area is smaller than the second threshold, the candidate area can be filtered.
  • the above method can help to pre-filter unstable areas (such as grass, trees and other areas that are prone to change in shape due to natural conditions), which is conducive to improving the quality of subsequent point pairs constructed, which in turn can help improve visual perception. Positioning accuracy and robustness.
  • the first direction information may include the first direction vector, then for each candidate area, the intersection point of the first direction vector between the pixels in the candidate area may be obtained first, and then The outlier rate of the intersection point is counted to obtain the consistency of the candidate area.
  • the preset condition can be set so that the outlier rate is lower than the outlier rate threshold, that is, as mentioned above, the first direction vector predicted by the landmark detection model may have a direction deviation.
  • the candidate The first direction vector of each pixel in the area may not exactly intersect at one point (i.e., landmark projection), then a threshold of outlier rate can be set in advance, and the RANSAC algorithm based on the straight line intersection model (i.e.
  • the candidate region can be filtered directly.
  • the initial position information of landmark point j in the image to be positioned It can be calculated by the aforementioned RANSAC algorithm based on the straight line intersection model, and these initial position information can be optimized through an iterative algorithm similar to the Expectation-Maximum (EM) to obtain the first position of the landmark point j in the image to be positioned.
  • the location information, optimization process can refer to the technical details of the EM iterative algorithm.
  • the candidate region may be discarded directly.
  • FIG. 7 is a schematic diagram of an embodiment of visual positioning using scale invariant feature transformation (Scale Invariant Feature Transform, SIFT), and Fig. 8 is a schematic diagram of using landmark points for visual positioning.
  • FIG. 9 is a schematic diagram of an embodiment of a first landmark prediction image
  • FIG. 10 is a schematic diagram of an embodiment of a first direction prediction image.
  • the first landmark prediction image and the first direction prediction image are obtained by using the landmark detection model to process the image to be positioned, the first landmark prediction image includes the predicted landmark attributes of the pixels in the image to be positioned, the first direction prediction image includes The first direction attribute of the pixel point in the bit image, the predicted landmark attribute is used to identify the landmark point corresponding to the pixel point, the first direction attribute includes the first direction information pointing to the landmark projection, and the landmark projection indicates that the landmark point corresponding to the pixel point is to be located The projected position in the image.
  • the first landmark prediction image and the first direction prediction image are analyzed to obtain the target landmark points, because the first landmark prediction image includes the landmark points corresponding to each pixel, and the first direction prediction image includes each The pixel point points to the direction information of the landmark projection, so it can greatly reduce the impact of the dynamic environment and improve the robustness of positioning.
  • FIG. 11 is a schematic flowchart of an embodiment of training a landmark detection model. May include the following steps:
  • Step S111 Determine the projection area and projection position of the sub-region and the landmark point in the sample image respectively.
  • the sample image is obtained by shooting a preset scene in a sample pose C.
  • the aforementioned sample pose C and camera intrinsic parameter K can be used to project onto the sample image to obtain the projection area of the sub-area in the sample image; similarly, for each landmark point, it is also possible to use
  • the aforementioned sample pose C and camera internal reference K are projected onto the sample image to obtain the projected positions of the landmark points in the sample image.
  • landmark point projection for a landmark point q j among several landmark points ⁇ q 1 ,q 2 ,...,q n ⁇ , its projection position l j in the sample image can be obtained by the following formula (1) :
  • f represents a projection function, which can refer to the conversion process between the world coordinate system, camera coordinate system, image coordinate system and pixel coordinate system.
  • Step S112 Based on the projection area and the projection position, determine the sample landmark attribute and the sample direction attribute of the sample pixel in the sample image.
  • the sample landmark attribute is used to identify the sample landmark point corresponding to the sample pixel point
  • the sample landmark point is the landmark point contained in the sub-area where the projection area covers the sample pixel point
  • the sample direction attribute includes pointing to the sample pixel point corresponding to The sample direction information of the projected position of the sample landmark point.
  • the attributes may include the landmark point label j of the landmark point q j in several landmark points ⁇ q 1 ,q 2 ,...,q n ⁇ .
  • a certain pixel in the sample image is not covered by the projection area, it can be considered that the pixel corresponds to the sky or some distant objects.
  • a special mark (such as 0) that has nothing to do with the landmark point labels of several landmark points ⁇ q 1 ,q 2 ,...,q n ⁇ can be used for identification, which can indicate that the pixel point has no effect on visual positioning. effect.
  • the sample direction information contained therein may be a sample direction vector pointing to the projected position of the sample landmark.
  • the sample direction vector may be a unit vector.
  • the sample landmark point corresponding to pixel i is landmark point q j
  • the projected position of landmark point q j in the sample image can be obtained by the above formula ( 1) Calculated (i.e. l j ), then the above unit vector d i can be expressed as:
  • Step S113 Obtain a sample landmark image and a sample direction image of the sample image based on the sample landmark attribute and the sample direction attribute respectively.
  • the size of the sample landmark image and the sample orientation image can be the same as the size of the sample image, that is, the first pixel in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel, and the first pixel in the sample orientation image is marked with the sample landmark attribute of the corresponding sample pixel.
  • the two pixel points are marked with the sample direction attribute of the corresponding sample pixel point.
  • the first pixel point in row i, column j in the sample landmark image is marked with the sample landmark attribute of the sample pixel point in row i, column j in the sample image
  • the second pixel point in row i, column j in the sample direction image is The pixel points are marked with the sample direction attribute of the sample pixel point in the i-th row and j-th column in the sample image.
  • the sample landmark image can be marked as S ⁇ H ⁇ W , that is, the resolution of the sample landmark image S is H*W, and each pixel value is an integer;
  • the sample direction attribute is represented by a sample direction vector
  • the sample direction image can be recorded as d ⁇ H ⁇ W ⁇ 2 , that is, the resolution of the sample direction image d is H*W, and the number of channels is 2,
  • each pixel value in the channel image is a real number, wherein the pixel value in one channel image represents an element of the sample direction vector, and the pixel value in the other channel image represents another element of the sample direction vector.
  • Step S114 using the sample image, the sample landmark image and the sample orientation image to train the landmark detection model.
  • the sample image can be predicted by using the landmark detection model to obtain the second feature prediction image and the second direction prediction image of the sample image, and the second feature prediction image includes the second feature representation of the sample pixel , the second direction prediction image includes a second direction attribute of the sample pixel point, the second direction attribute includes second direction information pointing to a sample landmark projection, and the sample landmark projection represents a projected position of the sample landmark point in the sample image.
  • the first loss can be obtained based on the sample landmark image and the second feature prediction image
  • the second loss can be obtained by using the difference between the sample orientation image and the second orientation prediction image to obtain the second loss based on the first loss and the second
  • the second loss optimizes the network parameters of the landmark detection model. Therefore, it is beneficial to improve the detection performance of the landmark detection model by supervising the training of the landmark detection model through the pre-built sample landmark images and sample orientation images.
  • the second direction information may include a second direction vector, where the second direction vector points to the sample landmark projection.
  • the second direction vector predicted by the landmark detection model may accurately point to the sample landmark projection, and during the training process, the landmark detection model's The performance is gradually improving, and is limited by various factors. The detection performance of the landmark detection model may not be able to reach the ideal state (that is, 100% accuracy).
  • the direction vector may not accurately point to the sample landmark projection, for example, there may be a certain angle deviation (eg, 1 degree, 2 degrees, 3 degrees, etc.) between the position pointed by the second direction vector and the sample landmark projection.
  • a certain angle deviation eg, 1 degree, 2 degrees, 3 degrees, etc.
  • a landmark feature representation set P can be maintained and updated, and the landmark feature representation set P includes each landmark point (eg, the aforementioned ⁇ q 1 , q 2 ,...,q n ⁇ ) feature representation to be optimized.
  • the feature representations to be optimized for each landmark point in the landmark feature representation set P may be obtained through random initialization.
  • the second feature prediction image can be marked as E, and the second feature representation of pixel i in the sample image can be marked as E i .
  • the image area composed of sample pixels with the same sample landmark attribute can be obtained, then for the sample pixel point i in the image area, the sample landmark attribute identified by the sample landmark
  • the feature representation of the punctuation to be optimized is used as the positive example feature representation P i+ of the sample pixel i
  • a reference feature representation is selected as the negative example feature representation P i- of the sample pixel i
  • the reference feature representation includes the positive example feature representation
  • feature representations to be optimized other than positive example feature representations can be selected from the landmark feature representation set P as reference feature representations.
  • the second similarity is to obtain a sub-loss, and based on the sub-loss of the sample pixel in the sample image, the first loss is obtained. For example, the sub-loss of each pixel in the sample image may be summed to obtain the first loss.
  • the above method on the one hand, by minimizing the first loss, can make the second feature representation as close as possible to its positive example feature representation and as far as possible away from its negative example feature representation, improving the prediction performance of the landmark prediction network, on the other hand
  • a reference feature representation as a negative feature representation, avoiding the loss of computing the second feature representation and all negative sample classes, the amount of computation and hardware consumption can be greatly reduced.
  • the above first similarity and second similarity can be processed based on the triplet loss function to obtain sub-losses, and sum the sub-losses of each sample pixel in the sample image to obtain the first loss
  • m represents the metric distance of the triplet loss
  • sim represents the cosine similarity function
  • the second feature representation of each sample pixel before calculating the first similarity and the second similarity, can be normalized by L2. On this basis, the first similarity between the normalized second feature representation and the positive example feature representation and the second similarity between the normalized second feature representation and the negative example feature representation can be calculated .
  • FIG. 12 is a schematic diagram of an embodiment of calculating the first loss.
  • the sample image contains four image areas composed of sample pixels with the same sample landmark attributes. Taking the image area in the lower right corner as an example, the sample land corresponding to the sample pixels in this image area The punctuation points are all landmark points i + , then the average feature representation of the second feature representation of the sample pixel points in the image area can be counted, and the average feature representation of the second feature representation of the sample pixel points in the image area can be obtained to obtain the average feature representation M i+ , and then based on the similarity between the average feature representation M i+ and each reference feature representation, several reference feature representations can be selected as candidate feature representations for the image region.
  • a reference feature representation at the front pre-order position such as the top k positions
  • the candidate feature representation of the image region the three points indicated by the curved arrows in Figure 12 feature representation to be optimized.
  • uniform sampling can be made in the candidate feature representation to obtain the negative example feature representation of the sample pixel. That is, since the sample pixels in the same image area are spatially close to each other and should have similar feature representations, they can also share similar negative feature representations. Therefore, for each image area, only representative ones need to be mined separately.
  • the negative example feature representation is enough, so each sample pixel in the image region only needs to be sampled from these representative negative example feature representations.
  • sample pixel 1 sample pixel 2, sample pixel 3, and sample pixel 4 in the image region
  • the corresponding negative example feature representations can be obtained from the aforementioned three feature representations to be optimized, respectively, as The feature representations to be optimized indicated by the bold arrows can be used as their respective negative example feature representations.
  • the above method on the one hand, can help improve the reference meaning of the reference feature representation, and on the other hand, can help reduce the complexity of selecting a negative feature representation for each sample pixel in the image area.
  • the second direction attribute includes the second direction information pointing to the projection of the sample landmark.
  • the second direction information may include the second direction vector pointing to the projection of the sample landmark.
  • the sample pixel point The second direction vector marked by i can be written as
  • the sample direction vector marked by the sample pixel point i can be recorded as d i , then the first loss
  • l represents the indicator function
  • S i ⁇ 0 represents the sample pixel point i marked with the corresponding sample landmark point in the sample landmark image S (that is, excludes the special mark such as 0 that represents the sky or distant objects sample pixels).
  • the first loss and the second loss can be weighted and summed to obtain the total loss
  • represents a weighting factor.
  • the network parameters of the landmark detection model and the feature representation to be optimized can be optimized based on the total loss.
  • the sample landmark attribute and sample direction attribute of the sample pixel point in the sample image are determined, and the sample landmark attribute It is used to identify the sample landmark point corresponding to the sample pixel point.
  • the sample landmark point is the landmark point contained in the sub-area where the projection area covers the sample pixel point.
  • the sample direction attribute includes the sample direction pointing to the projection position of the sample landmark point corresponding to the sample pixel point.
  • the sample landmark image and the sample direction image of the sample image are obtained, and the first pixel point in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel point , the second pixel in the sample direction image is marked with the sample direction attribute of the corresponding sample pixel, so that the training sample can be accurately constructed, and then the landmark detection model can be trained using the sample image, sample landmark image and sample direction image, which in turn can benefit Improve the detection performance of landmark detection models.
  • FIG. 13 is a schematic frame diagram of an embodiment of a visual positioning device 1300 of the present disclosure.
  • the visual positioning device 1300 includes: an information acquisition module 1310, a landmark detection module 1320, and a pose determination module 1330, wherein:
  • the information acquisition module 1310 is configured to acquire the image to be positioned captured by the preset scene
  • the landmark detection module 1320 is configured to perform landmark detection on the image to be positioned to obtain a target landmark point in the image to be positioned; wherein, the target landmark point is at least one of several landmark points in the preset scene, and the several landmark points are obtained from the preset scene Selected from the scene map, the scene map is obtained by performing three-dimensional modeling on the preset scene, and several landmark points are respectively located at the preset positions of each sub-area of the scene map;
  • the pose determining module 1330 is configured to obtain pose parameters of the image to be positioned based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map.
  • several sub-areas are obtained by dividing the surface of the scene map; and/or, the preset position includes the central position of the sub-areas; and/or, the area difference between each sub-area is lower than the first a threshold.
  • the landmark detection module 1320 includes: an image processing submodule configured to use a landmark detection model to process the image to be positioned, and obtain a first predicted landmark image and a first direction predicted image through prediction; an image analysis submodule configured to Analyzing the first landmark prediction image and the first direction prediction image to obtain target landmark points; wherein, the first landmark prediction image includes predicted landmark attributes of pixels in the image to be positioned, and the first direction prediction image includes pixel points in the image to be positioned The first direction attribute of the point, the predicted landmark attribute is used to identify the landmark point corresponding to the pixel point, the first direction attribute includes the first direction information pointing to the landmark projection, and the landmark projection indicates the projection of the landmark point corresponding to the pixel point in the image to be positioned Location.
  • an image processing submodule configured to use a landmark detection model to process the image to be positioned, and obtain a first predicted landmark image and a first direction predicted image through prediction
  • an image analysis submodule configured to Analyzing the first landmark prediction image and the first direction prediction image to obtain
  • the image analysis submodule includes: a candidate area acquisition unit configured to acquire a candidate area composed of pixels with the same predicted landmark attribute; a consistency statistics unit configured to count the number of pixels in the candidate area The consistency of one direction attribute; the landmark determination unit is configured to use the landmark point identified by the predicted landmark attribute of the pixel point in the candidate area as the target landmark point when the consistency meets the preset condition, and based on the candidate area The first direction attribute of the pixel point in the center is used to obtain the first position information of the target landmark point in the image to be located.
  • the image analysis submodule includes: a candidate region filtering unit configured to filter the candidate region when the region area of the candidate region is smaller than a second threshold.
  • the first direction information includes the first direction vector; the consistency statistics unit is further configured to obtain the intersection of the first direction vector between the pixels in the candidate area, and count the outlier rate of the intersection to obtain Consistency.
  • the landmark detection model includes a feature extraction network, a landmark prediction network, and a direction prediction network
  • the image processing submodule includes a feature extraction unit configured to use the landmark detection model to process the image to be positioned, and obtain the first landmark prediction image by prediction and the first direction prediction image
  • the landmark prediction unit is configured to use the landmark prediction image to perform landmark prediction on the feature image to obtain the first landmark prediction image
  • the direction prediction unit is configured to use the direction prediction network to perform direction prediction on the feature image to obtain the second One direction predicts the image.
  • the landmark prediction unit is further configured to use the landmark prediction network to decode the feature image to obtain a first feature prediction image, and the first feature prediction image includes a first feature representation of pixels in the image to be located; Based on the similarity between the first feature representation of the pixel point and the landmark feature representation of each landmark point, the predicted landmark attribute of the pixel point is obtained; wherein, the landmark feature representation is obtained after the landmark detection model training converges; based on the The predicted landmark attributes of each pixel in the image are obtained to obtain a first landmark predicted image.
  • the target landmark point is detected by using a landmark detection model
  • the visual positioning device 1300 further includes: a projection acquisition module configured to respectively determine the projection area and projection position of the sub-region and the landmark point in the sample image; attribute The determination module is configured to determine a sample landmark attribute and a sample direction attribute of a sample pixel point in the sample image based on the projection area and the projection position; wherein, the sample landmark attribute is used to identify a sample landmark point corresponding to the sample pixel point, and the sample landmark point is The projection area covers the landmark points contained in the sub-region of the sample pixel point, and the sample direction attribute includes sample direction information pointing to the projection position of the sample landmark point corresponding to the sample pixel point; the sample acquisition module is configured to be based on the sample landmark attribute and the sample direction respectively attribute, to obtain the sample landmark image and the sample direction image of the sample image; wherein, the first pixel in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel, and the second pixel in the sample direction image
  • the model training module includes: an image prediction sub-module configured to use a landmark detection model to predict the sample image to obtain a second feature prediction image and a second direction prediction image of the sample image; wherein, the second feature The predicted image includes a second feature representation of the sample pixel point, the second direction predicted image includes a second direction attribute of the sample pixel point, the second direction attribute includes second direction information pointing to the sample landmark projection, and the sample landmark projection represents the sample landmark point The projected position in the sample image; the loss calculation sub-module configured to obtain the first loss based on the sample landmark image and the second feature prediction image, and use the difference between the sample orientation image and the second orientation prediction image to obtain the second Loss: a parameter optimization sub-module configured to optimize network parameters of the landmark detection model based on the first loss and the second loss.
  • the loss calculation submodule includes: an image area and feature representation acquisition unit configured to acquire an image area composed of sample pixel points with the same sample landmark attributes; and acquire feature representations to be optimized for each landmark point;
  • the sub-loss calculation unit is configured to, for the sample pixel in the image region, use the feature representation to be optimized of the sample landmark identified by the sample landmark attribute as the positive example feature representation of the sample pixel, and select a reference feature representation as the sample pixel
  • the negative example feature representation of and based on the first similarity between the second feature representation and the positive example feature representation and the second similarity between the second feature representation and the negative example feature representation, a sub-loss is obtained; where, the reference feature The representation includes feature representations to be optimized except positive feature representations; the loss statistics unit is configured to obtain the first loss based on sub-losses of sample pixels in the sample image.
  • the sub-loss calculation unit is further configured to count the average feature representations of the second feature representations of the sample pixels in the image area; based on the similarity between the average feature representations and each reference feature representation, select several The reference feature representation is used as the candidate feature representation of the image region; the candidate feature representation is uniformly sampled to obtain the negative example feature representation of the sample pixels.
  • the parameter optimization submodule is further configured to optimize the feature representation to be optimized for each landmark point and the network parameters of the landmark detection model based on the first loss and the second loss.
  • FIG. 14 is a schematic frame diagram of an embodiment of an electronic device 140 of the present disclosure.
  • the electronic device 140 includes a memory 141 and a processor 142 coupled to each other, and the processor 142 is configured to execute program instructions stored in the memory 141 to implement any of the above visual positioning methods.
  • the electronic device 140 may include, but is not limited to: a microcomputer and a server.
  • the electronic device 140 may also include mobile devices such as notebook computers and tablet computers, which are not limited here.
  • the processor 142 is configured to control itself and the memory 141 to implement the steps of any one of the above embodiments of the visual positioning method.
  • the processor 142 may also be called a central processing unit (Central Processing Unit, CPU).
  • the processor 142 may be an integrated circuit chip with signal processing capability.
  • the processor 142 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other possible Program logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the processor 142 may be jointly realized by an integrated circuit chip.
  • the foregoing solution can improve the accuracy and robustness of visual positioning.
  • FIG. 15 is a schematic diagram of an embodiment of a computer-readable storage medium 150 of the present disclosure.
  • the computer-readable storage medium 150 stores program instructions 151 that can be executed by the processor, and the program instructions 151 are used to implement the steps of any of the above embodiments of the visual positioning method.
  • the foregoing solution can improve the accuracy and robustness of visual positioning.
  • the disclosed embodiments also provide a computer program, the computer program includes computer readable codes, and when the computer readable codes run in the electronic device, the processor of the electronic device executes the visual positioning method as described in any of the above embodiments.
  • the disclosed methods and devices may be implemented in other ways.
  • the device implementations described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present disclosure is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes.
  • the embodiment of the present application discloses a visual positioning method, device, equipment, medium and program, wherein the visual positioning method includes: acquiring an image to be positioned captured by a preset scene; performing landmark detection on the image to be positioned to obtain the image to be positioned The target landmark point; wherein, the target landmark point is at least one of several landmark points in the preset scene, and the several landmark points are selected from the scene map of the preset scene, and the scene map is a three-dimensional modeling of the preset scene obtained, and several landmark points are respectively located at the preset positions of each sub-area of the scene map; based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map, the undetermined The pose parameters of the bitmap.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Navigation (AREA)

Abstract

一种视觉定位方法、装置、设备、介质及程序,其中,视觉定位方法包括:获取对预设场景拍摄到的待定位图像(S11);对待定位图像进行地标检测,得到待定位图像中目标地标点(S12);其中,目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,且若干地标点分别位于场景地图各个子区域的预设位置处;基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数(S13)。上述方案,能够提高视觉定位的准确性和鲁棒性。

Description

视觉定位方法、装置、设备、介质及程序
相关申请的交叉引用
本专利申请要求2021年05月24日提交的中国专利申请号为202110564566.7、申请人为浙江商汤科技开发有限公司,申请名称为“视觉定位方法及相关装置、设备”的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本公开涉及计算机视觉技术领域,特别是涉及一种视觉定位方法、装置、设备、介质及程序。
背景技术
随着电子信息技术的发展,增强现实、混合现实等应用得到了越来越广泛的应用。诸如此类应用通常要求较优的视觉定位准确性和鲁棒性,以达到更好的视觉效果,增强用户体验。
目前,通常采用基于场景坐标回归方式或者基于特征的视觉定位框架,来构建密集的2D-3D点对,以基于稠密点对恢复相机位姿。然而,这些场景坐标中通常包含大量离群点,特别是在存在移动物体、照明变化等动态环境下,离群点比例也会随之提高,从而无法稳定且可靠地进行视觉定位。有鉴于此,如何提高视觉定位的准确性和鲁棒性成为亟待解决的问题。
发明内容
本公开提供一种视觉定位方法、装置、设备、介质及程序。
本公开实施例提供了一种视觉定位方法,所述方法由电子设备执行,所述方法包括:获取对预设场景拍摄到的待定位图像;对待定位图像进行地标检测,得到待定位图像中目标地标点;其中,目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,且若干地标点分别位于场景地图各个子区域的预设位置处;基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。
在本公开的一些实施例中,若干子区域是对场景地图的表面进行划分得到的;和/或,预设位置包括子区域的中心位置;和/或,各个子区域之间的面积差异低于第一阈值。因此,若干子区域是对场景地图的表面进行划分得到的,由于待定位图像通常是对预设场景表面的成像,故能够有利于提高在待定位图像中所检测的目标地标点的准确性;而将预设位置设置为包括子区域的中心位置,能够有利于提高地标点分布均匀的特性,有利于提升点对质量;此外,将各个子区域之间的面积差异设置为低于第一阈值,能够有利于提高地标点分布均匀的特性,有利于提升点对质量。
在本公开的一些实施例中,对待定位图像进行地标检测,得到待定位图像中目标地标点,包括:利用地标检测模型处理待定位图像,预测得到第一地标预测图像和第一方向预测图像;对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点;其中,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像包括待定位图像中像素点的第一方向属性,预测地标属性用于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在待定位图像中的投影位置。因此,通过利用地标检测模型处理待定位图像,得到第一地标预测图像和第一方向预测图像,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像包括待定位图像中像素点的第一方向属性,预测地标属性用 于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在待定位图像中的投影位置。在此基础上,再对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点,由于第一地标预测图像包括各个像素点所对应的地标点,而第一方向预测图像包括各个像素点指向地标投影的方向信息,故能够大大降低动态环境影响,提高定位鲁棒性。
在本公开的一些实施例中,对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点,包括:获取具有相同预测地标属性的像素点所构成的候选区域;统计候选区域中像素点的第一方向属性的一致性情况;在一致性情况满足预设条件的情况下,将候选区域中像素点的预测地标属性所标识的地标点作为目标地标点,并基于候选区域中像素点的第一方向属性,得到目标地标点在待定位图像中的第一位置信息。因此,通过获取具有相同预测地标属性的像素点所构成的候选区域,并统计候选区域中像素点的第一方向属性的一致性情况,在一致性情况满足预设条件的情况下,将候选区域中像素点的预测地标属性所标识的地标点作为目标地标点,并基于候选区域中像素点的第一方向属性,得到目标地标点在待定位图像中的第一位置信息。即在基于候选区域中像素点的预测地标属性确定目标地标点之前,先对候选区域中像素点的第一方向属性的一致性情况进行检测,从而能够有利于确保候选区域中像素点的第一方向属性的一致性,提高后续所构建的点对的质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
在本公开的一些实施例中,在统计候选区域中像素点的第一方向属性的一致性情况之前,方法还包括:在候选区域的区域面积小于第二阈值的情况下,过滤候选区域。因此,在统计候选区域中像素点的第一方向属性的一致性情况之前,先检测候选区域的区域面积,并在区域面积小于第二阈值的情况下,过滤该候选区域,通过预先过滤区域面积过小的候选区域,能够有利于预先滤除不稳定区域,有利于提高后续所构建的点对的质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
在本公开的一些实施例中,第一方向信息包括第一方向矢量;统计候选区域中像素点的第一方向属性的一致性情况,包括:获取候选区域中像素点之间的第一方向矢量的交点;统计交点的外点率,得到一致性情况。因此,第一方向信息包括第一方向矢量,通过获取候选区域中像素点之间的第一方向矢量的交点,并统计交点的外点率得到一致性情况,即一致性情况能够有效反映候选区域中像素点的第一方向属性的整体预测质量,有利于提高后续所构建的点对的质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
在本公开的一些实施例中,所地标检测模型包括特征提取网络、地标预测图像和方向预测网络;利用地标检测模型处理待定位图像,预测得到第一地标预测图像和第一方向预测图像,包括:利用特征提取网络对待定位图像进行特征提取,得到特征图像;利用地标预测网络对特征图像进行地标预测,得到第一地标预测图像;以及,利用方向预测网络对特征图像进行方向预测,得到第一方向预测图像。因此,地标检测模型包括特征提取网络、地标预测网络和方向预测网络,从而利用特征提取网络对待定位图像进行特征提取,得到特征图像,并利用地标预测网络对特征图像进行地标预测,得到第一地标预测图像,以及利用方向预测网络对特征图像进行方向预测,得到第一方向预测图像,即地标预测网络和方向预测网络分别负责预测地标和方向,且地标预测网络和方向预测网络共享特征提取所提取得到的特征图像,故能够有利于提高预测效率。
在本公开的一些实施例中,利用地标预测网络对特征图像进行地标预测,得到第一地标预测图像,包括:利用地标预测网络对特征图像进行解码,得到第一特征预测图像;其中,第一特征预测图像包括待定位图像中像素点的第一特征表示;基于像素点的第一特征表示分别与各个地标点的地标特征表示之间的相似度,得到像素点的预测地标属性;其中,地标特征表示是在地标检测模型训练收敛之后得到的;基于待定位图像中各个像素点的预测地标属性,得到第一地标预测图像。因此,通过利用地标预测网络对特征图 像进行解码,得到第一特征预测图像,且第一特征预测图像包括待定位图像中像素点的第一特征表示。在此基础上,基于像素点的第一特征表示分别和各个地标点的地标特征表示之间的相似度,得到像素点的预测地标属性,且地标特征表示是地标检测模型训练收敛之后得到的,再基于待定位图像中各个像素点的预测地标属性,得到第一地标预测图像,由于地标检测模型训练收敛之后所得到的地标特征表示能够准确表征地标点特征,故通过预测像素点的第一特征表示,并基于第一特征表示分别和各个地标特征表示之间的相似度,得到像素点的预测地标属性,能够有利于提高预测地标属性的准确性。
在本公开的一些实施例中,目标地标点是利用地标检测模型检测得到的,地标检测模型的训练步骤包括:分别确定子区域和地标点在样本图像的投影区域和投影位置;基于投影区域和投影位置,确定样本图像中样本像素点的样本地标属性和样本方向属性;其中,样本地标属性用于标识样本像素点对应的样本地标点,且样本地标点为投影区域覆盖样本像素点的子区域所含的地标点,样本方向属性包括指向样本像素点对应的样本地标点的投影位置的样本方向信息;分别基于样本地标属性和样本方向属性,得到样本图像的样本地标图像和样本方向图像;其中,样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性;利用样本图像、样本地标图像和样本方向图像训练地标检测模型。因此,目标地标点是利用地标检测模型检测得到的,通过先分别确定子区域和地标点在样本图像的投影区域和投影位置,之后基于投影区域和投影位置,确定样本图像中样本像素点的样本地标属性和样本方属性,且样本地标属性用于标识样本像素点对应的样本地标点,样本地标点为投影区域覆盖样本像素点的子区域所含的地标点,样本方向属性包括指向样本像素点对应的样本地标点的投影位置的样本方向信息。在此基础上,再分别基于样本地标属性和样本方向属性,得到样本图像的样本地标图像和样本方向图像,且样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性。从而可以精确构建训练样本,之后再利用样本图像、样本地标图像和样本方向图像训练地标检测模型,进而能够有利于提高地标检测模型的检测性能。
在本公开的一些实施例中,利用样本图像、样本地标图像和样本方向图像训练地标检测模型,包括:利用地标检测模型对样本图像进行预测,得到样本图像的第二特征预测图像和第二方向预测图像;其中,第二特征预测图像包括样本像素点的第二特征表示,第二方向预测图像包括样本像素点的第二方向属性,第二方向属性包括指向样本地标投影的第二方向信息,且样本地标投影表示样本地标点在样本图像中的投影位置;基于样本地标图像和第二特征预测图像,得到第一损失,并利用样本方向图像和第二方向预测图像之间的差异,得到第二损失;基于第一损失、第二损失,优化地标检测模型的网络参数。因此,利用地标检测模型对样本图像进行预测,得到样本图像的第二特征预测图像和第二方向预测图像,且第二特征图像包括样本像素点的第二特征表示,第二方向预测图像包括样本像素点的第二方向属性,第二方向属性包括指向样本地标投影的第二方向信息,样本地标投影表示样本地标点在样本图像中的投影位置。在此基础上,再基于样本地标图像和第二特征预测图像,得到第一损失,并利用样本方向图像和第二方向预测图像之间的差异,得到第二损失。从而基于第一损失、第二损失,优化地标检测模型的网络,进而能够通过预先构建的样本地标图像和样本方向图像监督地标检测模型的训练,有利于提升地标检测模型的检测性能。
在本公开的一些实施例中,基于样本地标图像和第二特征预测图像,得到第一损失,包括:获取具有相同样本地标属性的样本像素点所构成的图像区域,并获取各个地标点的待优化特征表示;对于图像区域中样本像素点,将样本地标属性所标识的样本地标点的待优化特征表示作为样本像素点的正例特征表示,并选择一个参考特征表示作为样本 像素点的负例特征表示,以及基于第二特征表示与正例特征表示之间的第一相似度和第二特征表示与负例特征表示之间的第二相似度,得到子损失;其中,参考特征表示包括除正例特征表示之外的待优化特征表示;基于样本图像中样本像素点的子损失,得到第一损失。因此,获取具有相同样本地标属性的样本像素点所构成的图像区域,并获取各个地标点待优化特征表示。从而对于图像区域中样本像素点,将样本地标属性所标识的样本地标点的待优化特征表示作为样本像素点的正例特征表示,并选择一个参考特征表示作为样本像素点的负例特征表示,且参考特征表示包括除正例特征表示之外的待优化特征表示,进而基于第二特征表示与正例特征表示之间的第一相似度和第二特征表示与负例特征表示之间的第二相似度,得到子损失,以基于样本图像中样本像素点的子损失,得到第一损失,故此,一方面通过最小化第一损失,能够使得第二特征表示尽可能地趋近其正例特征表示并尽可能地疏离其负例特征表示,提高地标预测网络的预测性能,另一方面通过选择一个参考特征表示作为负例特征表示,避免计算第二特征表示与所有负样本类的损失,能够大大减少计算量和硬件消耗。
在本公开的一些实施例中,选择一个参考特征表示作为样本像素点的负例特征表示,包括:统计图像区域中样本像素点的第二特征表示的平均特征表示;基于平均特征表示分别与各个参考特征表示之间的相似度,选择若干参考特征表示作为图像区域的候选特征表示;在候选特征表示中均匀采样,得到样本像素点的负例特征表示。因此,统计图像区域中样本像素点的第二特表示的平均特征表示,并基于平均特征表示分别与各个参考特征表示之间的相似度,选择若干参考特征表示作为图像区域的候选特征表示。从而在候选特征表示中均匀采样,得到样本像素点的负例特征表示,由于平均特征表示能够表征图像区域整体的特征表示,故通过平均特征表示来选择图像区域中像素点可选的参考特征表示,之后在参考特征表示中均匀采样即可得到样本像素点的负例特征表示,一方面能够有利于提升参考特征表示的参考意义,另一方面能够有利于降低图像区域中每个样本像素点选择负例特征表示的复杂度。
在本公开的一些实施例中,基于第一损失、第二损失,优化地标检测模型的网络参数,包括:基于第一损失和第二损失,优化各个地标点的待优化特征表示和地标检测模型的网络参数。因此,基于第一损失和第二损失,优化各个地标点的待优化特征和地标检测模型的网络参数,故能够在训练过程中同时优化地标检测模型的网络参数和各个地标点的待优化特征表示,有利于提升地标检测的准确性和鲁棒性,从而能够有利于提高点对质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
以下装置、电子设备等的效果描述参见上述是视觉定位方法的说明。
本公开实施例提供了一种视觉定位装置,包括:信息获取模块、地标检测模块和位姿确定模块,信息获取模块,配置为获取对预设场景拍摄到的待定位图像;地标检测模块,配置为对待定位图像进行地标检测,得到待定位图像中目标地标点;其中,目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,且若干地标点分别位于场景地图各个子区域的预设位置处;位姿确定模块,配置为用于基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。
在本公开的一些实施例中,若干子区域是对场景地图的表面进行划分得到的;和/或,预设位置包括子区域的中心位置;和/或,各个子区域之间的面积差异低于第一阈值。
在本公开的一些实施例中,地标检测模块,包括:图像处理子模块,配置为利用地标检测模型处理待定位图像,预测得到第一地标预测图像和第一方向预测图像;图像分析子模块,配置为对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点;其中,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像 包括待定位图像中像素点的第一方向属性,预测地标属性用于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在待定位图像中的投影位置。
在本公开的一些实施例中,图像分析子模块,包括:候选区域获取单元,配置为获取具有相同预测地标属性的像素点所构成的候选区域;一致性统计单元,配置为统计候选区域中像素点的第一方向属性的一致性情况;地标确定单元,配置为在一致性情况满足预设条件的情况下,将候选区域中像素点的预测地标属性所标识的地标点作为目标地标点,并基于候选区域中像素点的第一方向属性,得到目标地标点在待定位图像中的第一位置信息。
在本公开的一些实施例中,图像分析子模块,还包括:候选区域过滤单元,配置为在候选区域的区域面积小于第二阈值的情况下,过滤候选区域。
在本公开的一些实施例中,第一方向信息包括第一方向矢量;一致性统计单元,还配置为获取候选区域中像素点之间的第一方向矢量的交点;统计交点的外点率,得到一致性情况。
在本公开的一些实施例中,地标检测模型包括特征提取网络、地标预测网络和方向预测网络;图像处理子模块,包括:特征提取单元,配置为利用特征提取网络对待定位图像进行特征提取,得到特征图像;地标预测单元,配置为利用地标预测网络对特征图像进行地标预测,得到第一地标预测图像;方向预测单元,配置为利用方向预测网络对特征图像进行方向预测,得到第一方向预测图像。
在本公开的一些实施例中,地标预测单元,还配置为利用地标预测网络对特征图像进行解码,得到第一特征预测图像;其中,第一特征预测图像包括待定位图像中像素点的第一特征表示;基于像素点的第一特征表示分别与各个地标点的地标特征表示之间的相似度,得到像素点的预测地标属性;其中,地标特征表示是在地标检测模型训练收敛之后得到的;基于待定位图像中各个像素点的预测地标属性,得到第一地标预测图像。
在本公开的一些实施例中,目标地标点是利用地标检测模型检测得到的,视觉定位装置还包括:投影获取模块,配置为分别确定子区域和地标点在样本图像的投影区域和投影位置;属性确定模块,配置为基于投影区域和投影位置,确定样本图像中样本像素点的样本地标属性和样本方向属性;其中,样本地标属性用于标识样本像素点对应的样本地标点,且样本地标点为投影区域覆盖样本像素点的子区域所含的地标点,样本方向属性包括指向样本像素点对应的样本地标点的投影位置的样本方向信息;样本获取模块,配置为分别基于样本地标属性和样本方向属性,得到样本图像的样本地标图像和样本方向图像;其中,样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性;模型训练模块,配置为利用样本图像、样本地标图像和样本方向图像训练地标检测模型。
在本公开的一些实施例中,模型训练模块包括:图像预测子模块,配置为利用地标检测模型对样本图像进行预测,得到样本图像的第二特征预测图像和第二方向预测图像;其中,第二特征预测图像包括样本像素点的第二特征表示,第二方向预测图像包括样本像素点的第二方向属性,第二方向属性包括指向样本地标投影的第二方向信息,且样本地标投影表示样本地标点在样本图像中的投影位置;损失计算子模块,配置为基于样本地标图像和第二特征预测图像,得到第一损失,并利用样本方向图像和第二方向预测图像之间的差异,得到第二损失;参数优化子模块,配置为基于第一损失、第二损失,优化地标检测模型的网络参数。
在本公开的一些实施例中,损失计算子模块包括:图像区域和特征表示获取单元,配置为获取具有相同样本地标属性的样本像素点所构成的图像区域,并获取各个地标点的待优化特征表示;子损失计算单元,配置为对于图像区域中样本像素点,将样本地标 属性所标识的样本地标点的待优化特征表示作为样本像素点的正例特征表示,并选择一个参考特征表示作为样本像素点的负例特征表示,以及基于第二特征表示与正例特征表示之间的第一相似度和第二特征表示与负例特征表示之间的第二相似度,得到子损失;其中,参考特征表示包括除正例特征表示之外的待优化特征表示;损失统计单元,配置为基于样本图像中样本像素点的子损失,得到第一损失。
在本公开的一些实施例中,子损失计算单元,还配置为统计图像区域中样本像素点的第二特征表示的平均特征表示;基于平均特征表示分别与各个参考特征表示之间的相似度,选择若干参考特征表示作为图像区域的候选特征表示;在候选特征表示中均匀采样,得到样本像素点的负例特征表示。
在本公开的一些实施例中,参数优化子模块,还配置为基于第一损失和第二损失,优化各个地标点的待优化特征表示和地标检测模型的网络参数。
本公开实施例提供了一种电子设备,包括相互耦接的存储器和处理器,处理器用于执行存储器中存储的程序指令,以实现上述的视觉定位方法。
本公开实施例提供了一种计算机可读存储介质,其上存储有程序指令,程序指令被处理器执行时实现上述的视觉定位方法。
本公开实施例还提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行上述的视觉定位方法。
本公开实施例提供的视觉定位方法、装置、设备、介质及程序,通过获取对预设场景拍摄到的待定位图像,并对待定位图像进行地标检测,得到待定位图像中目标地标点,且目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,若干地标点分别位于场景地图各个子区域的预设位置处。在此基础上,再基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。由于若干地标点分别位于场景地图各个子区域的预设位置处,故地标点并非杂乱无章,其具有分布均匀的特性,而在待定位图像中所检测到的目标地标点为若干地标点中的至少一个,且后续视觉定位过程中,仅仅依赖于目标地标点在待定位图像中的二维位置和在场景地图中的三维位置所组成的点对,而不再依赖于其他与地标点无关的点对。从而能够在减少点对数量的同时提高点对质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开实施例的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1是本公开视觉定位方法一实施例的流程示意图;
图2是场景地图一实施例的示意图;
图3是利用地标检测模型检测目标地标点一实施例的示意图;
图4是定位目标地标点一实施例的示意图;
图5是应用本公开实施例的视觉定位方法的一种***架构示意图;
图6是图1中步骤S12一实施例的流程示意图;
图7是利用SIFT特征进行视觉定位一实施例的示意图;
图8是利用地标点进行视觉定位一实施例的示意图;
图9是第一地标预测图像一实施例的示意图;
图10是第一方向预测图像一实施例的示意图;
图11是训练地标检测模型一实施例的流程示意图;
图12是计算第一损失一实施例的示意图;
图13是本公开视觉定位装置一实施例的框架示意图;
图14是本公开电子设备一实施例的框架示意图;
图15是本公开计算机可读存储介质一实施例的框架示意图。
具体实施方式
下面结合说明书附图,对本公开实施例的方案进行详细说明。
以下描述中,为了说明而不是为了限定,提出了诸如特定***结构、接口、技术之类的具体细节,以便透彻理解本公开。
本文中术语“***”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,本文中的“多”表示两个或者多于两个。
请参阅图1,图1是本公开视觉定位方法一实施例的流程示意图。可以包括如下步骤:
步骤S11:获取对预设场景拍摄到的待定位图像。
在一个实施场景中,预设场景可以根据实际应用需要进行设置。例如,在需要在景区实现视觉定位的情况下,预设场景可以包含景区;或者,在需要在商业街实现视觉定位的情况下,预设场景可以包括商业街;或者,在需要在工业园区实现视觉定位的情况下,预设场景可以包括工业园区。其他情况可以以此类推,在此不再一一举例。
在一个实施场景中,待定位图像可以是以任意视角拍摄预设场景而得到的。例如,待定位图像可以仰拍预设场景而得到的;或者,待定位图像可以是俯拍预设场景而得到的;或者,待定位图像可以是平拍预设场景而得到的。
在另一个实施场景中,为了提高视觉定位的准确性,在拍摄预设场景时相机光轴与水平面之间夹角应低于预设角度阈值,即待定位图像中应尽可能多地包含预设场景,而尽可能少地包含地面、天空等无效区域。
步骤S12:对待定位图像进行地标检测,得到待定位图像中目标地标点。
本公开的一些实施例中,目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,且若干地标点分别位于场景地图各个子区域的预设位置处。
在一个实施场景中,可以预先采集预设场景的拍摄视频,并利用三维重建算法对拍摄视频进行处理,得到预设场景的场景地图。三维重建算法可以包括但不限于:Multi View stereo、Kinect fusion等,在此不做限定。三维重建算法的实施过程,可以参阅其算法的技术细节。
在一个实施场景中,若干子区域是对场景地图的表面进行划分得到的。本公开的一些实施例中,可以通过三维过分割算法(如,supervoxel)将场景地图的表面划分为若干子区域。请结合参阅图2,图2是场景地图一实施例的示意图。如图2所示,不同灰度区域表示场景地图表面的不同子区域。
在一个实施场景中,预设位置可以包括子区域的中心位置。请继续结合参阅图2,如 图2所示,子区域中黑点即表示在该子区域所确定的地标点。
在一个实施场景中,各个子区域之间的面积差异可以低于第一阈值,第一区域可以根据实际情况进行设置,如可以设置为:10像素点、15像素点、20像素点等等,在此不做限定。也就是说,各个子区域具有相似尺寸。
上述方式,通过将场景地图表面均匀划分为若干子区域,并在若干子区域的中心位置选择得到地标点,故地标点均匀分布于场景地图表面,从而无论以何种视角对预设场景拍摄待定位图像,待定位图像中均含有足够的地标点,进而能够有利于提高视觉定位的鲁棒性。
在一个实施场景中,为了提高地标检测的效率和准确性,可以预先训练一个地标检测模型,从而可以利用地标检测模型对待定位图像进行检测分析,得到待定位图像中的目标地标点。为了便于描述,预设场景的若干地标点可以记为{q 1,q 2,…,q n},目标地标点可以为上述若干地标点{q 1,q 2,…,q n}中的至少一个。
在另一个实施场景中,为了提升地标检测的效率和准确性,利用地标检测模型处理待定位图像之后,可以得到第一地标预测图像和第一方向预测图像,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像包括待定位图像中像素点的第一方向属性,预测地标属性用于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在待定位图像中的投影位置。在此基础上,再对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点。本公开的一些实施例中,地标检测模型的训练过程,可以查阅下述相关公开实施例。区别于前述方式,由于第一地标预测图像包括各个像素点所对应的地标点,而第一方向预测图像包括各个像素点指向地标投影的方向信息,故能够大大降低动态环境影响,提高定位鲁棒性。
在一个实施场景中,请结合参阅图3,图3是利用地标检测模型检测目标地标点一实施例的示意图。如图3所示,地标检测模型可以包括特征提取网络、地标预测网络和方向预测网络,则可以利用特征提取网络对待定位图像进行特征提取,得到特征图像,并利用地标预测网络对特征图像进行地标预测,得到第一地标预测图像,以及利用方向预测网络对特征图像进行方向预测,得到第一方向预测图像,,即地标预测网络和方向预测网络分别负责预测地标和方向,且地标预测网络和方向预测网络共享特征提取所提取得到的特征图像,故能够有利于提高预测效率。
在另一个实施场景中,请继续结合参阅图3,为了便于描述,具有相同预测地标属性的像素点以相同灰度显示,也就是说,图3所示的第一地标预测图像中,以相同灰度显示的像素点其对应于相同地标点(如,前述若干地标点{q 1,q 2,…,q n}中某一地标点)。同时为了便于描述,在第一方向预测图像中可以通过不同灰度来表示像素点的方向预测属性。如图3中示例所示,0度方向、45度方向、90度方向、135度方向、180度方向、225度方向、270度方向以及315度方向分别以不同灰度表示。需要说明的是,图3所示的第一地标预测图像和第一方向预测图像仅仅是实际应用过程一种可能的表现形式,通过不同灰度来表示预测地标属性和预测方向属性,能够实现地标检测模型的预测可视化。在实际应用过程中,也可以直接以数字来表示地标预测网络和方向预测网络的输出结果,在此不做限定。
在又一个实施场景中,请结合参阅图4,图4是定位目标地标点一实施例的示意图。如图4所示,图中空心圆表示在待定位图像中定位得到的目标地标点,右下角矩形框区域是对左上角矩形框区域的放大示意图,如图4中右下角矩形框区域所示,相同灰度的像素点表示具有相同预测地标属性,方向箭头表示像素点的预测方向属性。因此可以基于该相同的预测地标属性,确定该预测地标属性所标识的目标地标点(如,{q 1,q 2,…,q n} 中某一地标点),并基于这些具有相同预测地标属性的像素点的预测方向属性,确定该目标地标点在待定位图像中的位置信息(如,图中实心圆所示位置)。例如,可以通过确定图4所示方向箭头的交点,确定目标地标点在待定位图像中的位置信息。相关实施过程可以参阅下述公开实施例中相关描述。
在又一个实施场景中,第一地标预测图像和第一方向预测图像两者可以与待定位图像尺寸相同;或者,第一地标预测图像和第一方向预测图像至少一者也可以与待定位图像尺寸不同。
在又一个实施场景中,可以将DeepLabV3作为地标检测模型的骨干网络,其能够通过空间金字塔池化来显著扩大感受野。
步骤S13:基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。
本公开的一些实施例中,目标地标点在待定位图像中的第一位置信息可以是二维坐标,目标地标点在场景地图中的第二位置信息可以是三维坐标。此外,如前所述,地标点是从预设场景的场景地图中选择得到的,且场景地图是对预设场景进行三维建模得到的,故地标点在场景地图中的第二位置信息可以直接基于场景地图确定得到。在此基础上,可以基于目标地标点的标号以及场景地图中若干地标点的标号,确定若干地标点中标号与目标地标点对应的地标点,并将对应的地标点的第二位置信息,作为目标地标点的第二位置信息。请结合参阅图4,在检测得到若干目标地标点(即图中空心圆)的基础上,可以基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,可以建立若干2D-3D点对,基于若干2D-3D点对,可以恢复得到待定位图像的位姿参数(如,6自由度参数)。本公开的一些实施例中,可以采用基于随机抽样一致(Random Sample Consensus,RANSAC)PnP算法求取位姿参数。相关算法步骤可以参阅RANSAC PnP的技术细节,在此不再赘述。
上述方案,通过获取对预设场景拍摄到的待定位图像,并对待定位图像进行地标检测,得到待定位图像中目标地标点,且目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,若干地标点分别位于场景地图各个子区域的预设位置处。在此基础上,再基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。由于若干地标点分别位于场景地图各个子区域的预设位置处,故地标点并非杂乱无章,其具有分布均匀的特性,而在待定位图像中所检测到的目标地标点为若干地标点中的至少一个,且后续视觉定位过程中,仅仅依赖于目标地标点在待定位图像中的二维位置和在场景地图中的三维位置所组成的点对,而不再依赖于其他与地标点无关的点对。从而能够在减少点对数量的同时提高点对质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
图5示出可以应用本公开实施例的视觉定位方法的一种***架构示意图;如图5所示,该***架构中包括:图像获取终端501、网络502和位姿参数确定终端503。为实现支撑一个示例性应用,图像获取终端501和位姿参数确定终端503通过网络502建立通信连接,图像获取终端501通过网络502向位姿参数确定终端503上报待定位图像,位姿参数确定终端503对待定位图像进行地标检测,得到待定位图像中目标地标点;基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。最后,位姿参数确定终端503将待定位图像的位姿参数上传至网络502,并通过网络502发送给图像获取终端501。
作为示例,图像获取终端501可以包括图像采集设备,位姿参数确定终端503可以包括具有视觉信息处理能力的视觉处理设备或远程服务器。网络502可以采用有线或无线连接方式。其中,当位姿参数确定终端503为视觉处理设备时,图像获取终端501可 以通过有线连接的方式与视觉处理设备通信连接,例如通过总线进行数据通信;当位姿参数确定终端503为远程服务器时,图像获取终端501可以通过无线网络与远程服务器进行数据交互。
或者,在一些场景中,图像获取终端501可以是带有视频采集模组的视觉处理设备,可以是带有摄像头的主机。这时,本公开实施例的视觉定位方法可以由图像获取终端501执行,上述***架构可以不包含网络502和位姿参数确定终端503。
请参阅图6,图6是图1中步骤S12一实施例的流程示意图。如图6所示,可以包括如下步骤:
步骤S61:利用地标检测模型处理待定位图像,预测得到第一地标预测图像和第一方向预测图像。
本公开的一些实施例中,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像包括待定位图像中像素点的第一方向属性,预测地标属性用于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在待定位图像中的投影位置。此外,第一地标预测图像和第一方向预测图像两者可以与待定位图像尺寸相同,或者,第一地标预测图像和第一方向预测图像至少一者可以与待定位图像尺寸不同,可以参阅前述公开实施例中相关描述。
在一个实施场景中,如前述公开实施例所述,若干地标点可以记为{q 1,q 2,…,q n},则预测地标属性可以包括像素点对应的地标点的标号,即在预测地标属性包括i的情况下,像素点对应的地标点为q i
在一个实施场景中,第一方向信息可以包括第一方向矢量,该第一方向矢量指向地标投影。本公开的一些实施例中,在地标检测模型的检测性能极佳的情况下,地标检测模型所预测出来的第一方向矢量可能准确地指向地标投影。在实际应用过程中,地标检测模型的检测性能受限于各种因素可能无法达到极佳,在此情况下,地标检测模型所预测出来的第一方向矢量可能并非准确指向地标投影,如第一方向矢量所指向的位置与地标投影之间可以存在一定的角度偏差(如,1度、2度、3度等),由于待定位图像中各个像素点均能够预测得到一个第一方向矢量,故通过多个像素点的第一方向矢量,能够修正单个第一方向矢量可能存在的方向偏差,其过程可以参阅下述相关描述。
在一个实施场景中,如前述公开实施例所述,地标检测模型可以包括特征提取网络、地标预测网络和方向预测网络,则可以利用特征提取网络对待定位图像进行特征提取,得到特征图像,并利用地标预测网络对特征图像进行地标预测,得到第一地标预测图像,以及利用方向预测网络对特征图像进行方向预测,得到第一方向预测图像。也就是说,地标预测网络和方向预测网络可以共享特征提取网络所提取得到的特征图像,其可以参阅前述公开实施例相关描述。
在一个实施场景中,如前所述,第一方向信息可以包括第一方向矢量,该第一方向矢量可以为一个模值为1的单位矢量。
在另一个实施场景中,利用地标预测网络可以对特征图像进行解码,得到第一特征预测图像,且第一特征预测图像包括待定位图像中像素点的第一特征表示。在此基础上,可以基于像素点的第一特征表示分别与各个地标点的地标特征表示之间的相似度,得到像素点的预测地标属性,且地标特征表示是在地标检测模型训练收敛之后得到的,并基于待定位图像中各个像素点的预测地标属性,得到第一地标预测图像。本公开的一些实施例中,在地标检测模型的训练过程中,可以维护并更新一个地标特征表示集合P,该地标特征表示集合P包含各个地标点(如,前述{q 1,q 2,…,q n})的待优化特征表示,在地标检测模型训练收敛之后,即可学习到预设场景各个地标点的特征信息,这些特征信息即反映于各个地标点收敛之后的待优化特征表示中。为了便于区分,可以将训练收敛的 待优化特征表示称之为地标特征表示。地标检测模型的训练过程,其可以参阅下述公开实施例。
此外,对于每一像素点,可以计算像素点的第一特征表示分别与各个地标点(如,前述{q 1,q 2,…,q n})的地标特征表示之间的相似度,并选择最高相似度对应的地标点,作为像素点对应的地标点,从而可以采用该地标点标识像素点,得到像素点的预测地标属性。例如,可以计算像素点的第一特征表示分别与各个地标点的地标特征表示之间的内积,并选取最小内积对应的地标点在预设场景的若干地标点中的标号(如,1、2、……、n等)来标识该地标点,以得到预测地标属性。在得到待定位图像中每个像素点的预测地标属性之后,即可得到第一地标预测图像。
本公开的一些实施例中,若像素点的第一特征表示与各个地标点的地标特征表示之间的相似度均较低(如,均低于一个相似度阈值),则可以认为该像素点为与预设场景无关的无效像素点(如,天空、地面等),在此情况下,可以采用一个特殊标记(如,0)来进行标识。
步骤S62:对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点。
在一个实施场景中,可以获取具有相同预测地标属性的像素点所构成的候选区域,即可以通过像素点的预测地标属性,将对应于相同地标点的像素点所构成的图像区域,作为一个候选区域。在此基础上,可以统计候选区域中像素点的第一方向属性的一致性情况,也就是说,对于每一候选区域,可以统计该候选区域中像素点的第一方向属性的一致性情况,从而可以得到各个候选区域的一致性情况。故此,可以在一致性情况满足预设条件的情况下,将候选区域中像素点的预测地标属性所标识的地标点作为目标地标点,并基于候选区域中像素点的第一方向属性,得到目标地标点在待定位图像中的第一位置信息。上述方式,在基于候选区域中像素点的预测地标属性确定目标地标点之前,先对候选区域中像素点的第一方向属性的一致性情况进行检测,从而能够有利于确保候选区域中像素点的第一方向属性的一致性,提高后续所构建的点对的质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
在一个实施场景中,为了提升视觉定位的准确性和鲁棒性,在统计候选区域中像素点的第一方向属性的一致性情况之前,还可以先检测候选区域的区域面积是否小于第二阈值,若候选区域的区域面积小于第二阈值,则可以过滤该候选区域。上述方式,能够有利于预先滤除不稳定区域(如,草丛、树木等随自然条件而极易发生形态变化的区域),有利于提高后续所构建的点对的质量,进而能够有利于提高视觉定位的准确性和鲁棒性。
在另一个实施场景中,如前所述,第一方向信息可以包括第一方向矢量,则对于每一候选区域,可以先获取该候选区域中像素点之间的第一方向矢量的交点,再统计交点的外点率,得到该候选区域的一致性情况。在此情况下,预设条件可以相应设置为外点率低于外点率阈值,即如前所述,地标检测模型所预测得到的第一方向矢量可能存在方向偏差,在此情况下,候选区域中各个像素点的第一方向矢量可能并不会准确相交于一点(即地标投影),则可以预先设置一个外点率阈值,并利用基于直线求交模型的RANSAC算法(即RANSAC with a vote intersection model,可以参阅其相关技术细节),计算外点率,若候选区域的外点率低于外点率阈值,则可以认为地标检测模型针对该候选区域所预测的方向一致性较好,反之,如候选区域的外点率不低于外点率阈值,则可以认为地标检测模型针对该候选区域的学习效果欠佳或者该候选区域本身存在较大噪声,为了防止后续影响视觉定位的准确性和鲁棒性,可以直接过滤该候选区域。
在又一个实施场景中,以候选区域对应于地标点j为例,地标点j在待定位图像中的初始位置信息
Figure PCTCN2021126039-appb-000001
可以由前述基于直线求交模型的RANSAC算法计算得到,这些初始位置信息可以通过类似于期望最大化(Expectation-Maximum,EM)迭代算法进行优化,以 得到地标点j在待定位图像中的第一位置信息,优化过程,可以参阅EM迭代算法的技术细节。本公开的一些实施例中,如前所述,在迭代优化过程中,若候选区域的的一致性情况欠佳,则可以直接舍弃该候选区域。
请结合参阅图7、图8、图9和图10,图7是利用尺度不变特征变换(Scale Invariant Feature Transform,SIFT)特征进行视觉定位一实施例的示意图,图8是利用地标点进行视觉定位一实施例的示意图,图9是第一地标预测图像一实施例的示意图,图10是第一方向预测图像一实施例的示意图。基于图9所示的第一地标预测图像,可以统计到图8右侧箭头在图9所指候选区域的区域面积过小,故可以过滤该不稳定的候选区域(从图8可以看出该候选区域对应于树木),并基于图10所示的第一方向预测图像,可以统计到图8左侧箭头在图10所指候选区域的一致性情况欠佳,故可以过滤该候选区域。在此基础上,可以基于过滤之后剩余的候选区域,得到目标地标点(如图8中X标记所示)。此外,关于图9所示的第一地标预测图像中不同灰度像素点的含义和图10所示的第一方向预测图像中不同灰度像素点的含义,可以参阅前述相关描述。与之不同的是,如图7所示,利用SIFT特征进行视觉定位,可以得到数量庞大的特征点(如图7中空心圆所示),且这些特征点中存在诸如对应于树木等不稳定区域的干扰点,从而一方面由于特征点数量过于庞大,导致后续视觉定位计算量陡增,另一方面由于特征点中极易存在干扰点,影像后续视觉定位的准确性和鲁棒性。
上述方案,通过利用地标检测模型处理待定位图像,得到第一地标预测图像和第一方向预测图像,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像包括待定位图像中像素点的第一方向属性,预测地标属性用于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在待定位图像中的投影位置。在此基础上,再对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点,由于第一地标预测图像包括各个像素点所对应的地标点,而第一方向预测图像包括各个像素点指向地标投影的方向信息,故能够大大降低动态环境影响,提高定位鲁棒性。
请参阅图11,图11是训练地标检测模型一实施例的流程示意图。可以包括如下步骤:
步骤S111:分别确定子区域和地标点在样本图像的投影区域和投影位置。
本公开实施例中,子区域和地标点的含义可以参阅前述公开实施例中相关描述。
在一个实施场景中,样本图像是以样本位姿C对预设场景进行拍摄得到的。对于场景地图各个子区域而言,可以通过前述样本位姿C以及相机内参K投影到样本图像,以得到子区域在样本图像中的投影区域;类似地,对于各个地标点而言,也可以利用前述样本位姿C以及相机内参K投影到样本图像,以得到地标点在样本图像中的投影位置。以地标点投影为例,对于若干地标点{q 1,q 2,…,q n}中的地标点q j而言,可以通过下面公式(1)得到其在样本图像中的投影位置l j
l j=f(q j,K,C)   公式(1);
上述公式(1)中,f表示投影函数,其可以参阅世界坐标系、相机坐标系、图像坐标系以及像素坐标系之间的转换过程。
步骤S112:基于投影区域和投影位置,确定样本图像中样本像素点的样本地标属性和样本方向属性。
本公开实施例中,样本地标属性用于标识样本像素点对应的样本地标点,且样本地标点为投影区域覆盖样本像素点的子区域所含的地标点,样本方向属性包括指向样本像素点对应的样本地标点的投影位置的样本方向信息。
对于样本地标属性,为了便于描述,以样本图像中像素点i为例,其在样本图像中位置坐标可以记为p i=(u i,v i),像素点i被投影区域j覆盖,投影区域j是场景地图中子 区域j在样本图像中的投影区域,且子区域j中包含地标点q j,则像素点i的样本地标属性标识该地标点q j,如像素点i的样本地标属性可以包括地标点q j在若干地标点{q 1,q 2,…,q n}中的地标点标签j。其他情况可以以此类推,在此不再一一举例。此外,若样本图像中某一像素点并未被投影区域覆盖,则可以认为该像素点对应于天空或某些远距离物体,在此情况下,该像素点的样本地标属性采用特殊标记来进行标识,如可以采用与若干地标点{q 1,q 2,…,q n}的地标点标签无关的特殊标记(如,0)来进行标识,以此可以表示该像素点对于视觉定位并无作用。
对于样本方向属性,其所包含的样本方向信息可以为一个指向样本地标点的投影位置的样本方向矢量。此外,该样本方向矢量可以为一个单位矢量。为了便于描述,仍以样本图像中像素点i为例,如前所述,像素点i对应的样本地标点为地标点q j,且地标点q j在样本图像中投影位置可以通过上述公式(1)计算得到(即l j),则上述单位矢量d i可以表示为:
d i=(l j-p i)/||l j-p i|| 2   公式(2);
步骤S113:分别基于样本地标属性和样本方向属性,得到样本图像的样本地标图像和样本方向图像。
在一个实施场景中,样本地标图像和样本方向图像两者的尺寸可以与样本图像尺寸相同,即样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性。也就是说,样本地标图像中第i行第j列第一像素点标注有样本图像中第i行第j列样本像素点的样本地标属性,而样本方向图像中第i行第j列第二像素点标注有样本图像中第i行第j列样本像素点的样本方向属性。此外,在样本地标属性包括地标点标签的情况下,样本地标图像可以记为S∈□ H×W,即样本地标图像S的分辨率为H*W,且其中每一像素值均为整数;类似地,在样本方向属性以样本方向矢量表示的情况下,样本方向图像可以记为d∈□ H×W×2,即样本方向图像d的分辨率为H*W,且通道数为2,且通道图像中每一像素值均为实数,其中一个通道图像中像素值表示样本方向矢量的一个元素,另一个通道图像中像素值表示样本方向矢量的另一个元素。
步骤S114:利用样本图像、样本地标图像和样本方向图像训练地标检测模型。
本公开的一些实施例中,可以利用地标检测模型对样本图像进行预测,得到样本图像的第二特征预测图像和第二方向预测图像,且第二特征预测图像包括样本像素点的第二特征表示,第二方向预测图像包括样本像素点的第二方向属性,第二方向属性包括指向样本地标投影的第二方向信息,样本地标投影表示样本地标点在样本图像中的投影位置。在此基础上,可以基于样本地标图像和第二特征预测图像,得到第一损失,并利用样本方向图像和第二方向预测图像之间的差异,得到第二损失,以基于第一损失和第二损失,优化地标检测模型的网络参数。故此,通过预先构建的样本地标图像和样本方向图像监督地标检测模型的训练,有利于提升地标检测模型的检测性能。
在一个实施场景中,与第一方向信息类似地,第二方向信息可以包括第二方向矢量,该第二方向矢量指向样本地标投影。本公开的一些实施例中,在地标检测模型的检测性能极佳的情况下,地标检测模型所预测出来的第二方向矢量可能准确地指向样本地标投影,而在训练过程中,地标检测模型的性能是逐渐趋优的,且受限于各种因素,地标检测模型的检测性能也可能无法达到理想状态(即100%的准确率),在此情况下,地标检测模型所预测出来的第二方向矢量可能并非准确指向样本地标投影,如第二方向矢量所指向的位置与样本地标投影之间可以存在一定的角度偏差(如,1度、2度、3度等)。
在一个实施场景中,如前所述,可以在地标检测模型的训练过程中,可以维护并更 新一个地标特征表示集合P,该地标特征表示集合P包含各个地标点(如,前述{q 1,q 2,…,q n})的待优化特征表示。本公开的一些实施例中,在首次训练时,该地标特征表示集合P中各个地标点的待优化特征表示可以是通过随机初始化得到的。此外,为了便于描述,第二特征预测图像可以记为E,则样本图像中像素点i的第二特征表示可以记为E i。为了降低计算第一损失的计算负荷以及资源消耗,可以获取具有相同样本地标属性的样本像素点所构成的图像区域,则对于图像区域中样本像素点i,可以将样本地标属性所标识的样本地标点的待优化特征表示作为样本像素点i的正例特征表示P i+,并选择一个参考特征表示作为样本像素点i的负例特征表示P i-,且参考特征表示包括除正例特征表示之外的待优化特征表示,也就是说,可以从地标特征表示集合P中选择处正例特征表示之外的待优化特征表示作为参考特征表示。在此基础上,可以基于样本像素点i的第二特征表示E i和正例特征表示P i+之间的第一相似度以及第二特征表示E i和负例特征表示P i-之间的第二相似度,得到子损失,并基于样本图像中样本像素点的子损失,得到第一损失。例如,可以对样本图像中各个像素点的子损失进行求和,得到第一损失。上述方式,一方面通过最小化第一损失,能够使得第二特征表示尽可能地趋近其正例特征表示并尽可能地疏离其负例特征表示,提高地标预测网络的预测性能,另一方面通过选择一个参考特征表示作为负例特征表示,避免计算第二特征表示与所有负样本类的损失,能够大大减少计算量和硬件消耗。
在一个实施场景中,可以基于三元组损失函数处理上述第一相似度和第二相似度,以得到子损失,并对样本图像中各个样本像素点的子损失进行求和,得到第一损失
Figure PCTCN2021126039-appb-000002
Figure PCTCN2021126039-appb-000003
上述公式(3)中,m表示三元组损失的度量距离,sim表示余弦相似度函数,本公开的一些实施例中,
Figure PCTCN2021126039-appb-000004
在另一个实施场景中,在计算上述第一相似度和第二相似度之前,可以先对各个样本像素点的第二特征表示通过L2进行归一化。在此基础上,可以计算归一化后的第二特征表示与正例特征表示之间的第一相似度以及归一化后的第二特征表示与负例特征表示之间的第二相似度。
在又一个实施场景中,请结合参阅图12,图12是计算第一损失一实施例的示意图。如图12中虚线划分所示,样本图像包含4块分别具有相同样本地标属性的样本像素点所构成的图像区域,以右下角图像区域为例,该图像区域中样本像素点所对应的样本地标点均为地标点i +,则可以统计该图像区域中样本像素点的第二特征表示的平均特征表示,可以将该图像区域中样本像素点的第二特征表示取平均值,得到平均特征表示M i+,之后可以基于平均特征表示M i+分别与各个参考特征表示之间的相似度,选择若干参考特征表示作为该图像区域的候选特征表示。例如,可以选择相似度按照从高到低排序位于前预设序位(如,前k位)的参考特征表示,作为该图像区域的候选特征表示(如图12中曲线箭头所指的三个待优化特征表示)。在此基础上,在获取该图像区域中各个样本像素点的负例特征表示时,可以在候选特征表示中均匀采样,得到样本像素点的负例特征表示。即由于相同图像区域中样本像素点在空间上是相互接近的,且应具有相似的特征表示,故也可以共享相似的负例特征表示,因此对于各个图像区域,仅需分别挖掘具有代表性的负例特征表示即可,从而图像区域中各个样本像素点仅需从这些具有代表性的负例特征表示中进行采样即可。例如,对于该图像区域中样本像素点1、样本像素点2、样本像素点3和样本像素点4,可以分别从前述三个待优化特征表示中均匀采样,得到对 应的负例特征表示,如可以分别将加粗箭头所指的待优化特征表示作为各自的负例特征表示。对于其他图像区域,可以以此类推,在此不再一一举例。上述方式,一方面能够有利于提升参考特征表示的参考意义,另一方面能够有利于降低图像区域中每个样本像素点选择负例特征表示的复杂度。
在一个实施场景中,如前所述,第二方向属性包括指向样本地标投影的第二方向信息,如第二方向信息可以包括指向样本地标投影的第二方向矢量,为了便于描述,样本像素点i所标记的第二方向矢量可以记为
Figure PCTCN2021126039-appb-000005
此外样本像素点i所标记的样本方向矢量可以记为d i,则第一损失
Figure PCTCN2021126039-appb-000006
Figure PCTCN2021126039-appb-000007
上述公式(4)中,l表示指示函数,S i≠0表示样本地标图像S中标识有对应样本地标点的样本像素点i(即排除表示天空或远距离物体而标记为诸如0的特殊标记的样本像素点)。
在一个实施场景中,在获取第一损失和第二损失之后,可以将第一损失和第二损失进行加权求和,得到总损失
Figure PCTCN2021126039-appb-000008
Figure PCTCN2021126039-appb-000009
上述公式(5)中,λ表示加权因子。在此基础上,可以基于总损失,优化地标检测模型的网络参数和待优化特征表示。
上述方案,通过先分别确定子区域和地标点在样本图像的投影区域和投影位置,之后基于投影区域和投影位置,确定样本图像中样本像素点的样本地标属性和样本方向属性,且样本地标属性用于标识样本像素点对应的样本地标点,样本地标点为投影区域覆盖样本像素点的子区域所含的地标点,样本方向属性包括指向样本像素点对应的样本地标点的投影位置的样本方向信息,在此基础上,再分别基于样本地标属性和样本方向属性,得到样本图像的样本地标图像和样本方向图像,且样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性,从而可以精确构建训练样本,之后再利用样本图像、样本地标图像和样本方向图像训练地标检测模型,进而能够有利于提高地标检测模型的检测性能。
请参阅图13,图13是本公开视觉定位装置1300一实施例的框架示意图。视觉定位装置1300包括:信息获取模块1310、地标检测模块1320和位姿确定模块1330,其中:
信息获取模块1310,配置为获取对预设场景拍摄到的待定位图像;
地标检测模块1320,配置为对待定位图像进行地标检测,得到待定位图像中目标地标点;其中,目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,且若干地标点分别位于场景地图各个子区域的预设位置处;
位姿确定模块1330,配置为基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。
在一些公开实施例中,若干子区域是对场景地图的表面进行划分得到的;和/或,预设位置包括子区域的中心位置;和/或,各个子区域之间的面积差异低于第一阈值。
在一些公开实施例中,地标检测模块1320包括:图像处理子模块,配置为利用地标检测模型处理待定位图像,预测得到第一地标预测图像和第一方向预测图像;图像分析子模块,配置为对第一地标预测图像和第一方向预测图像进行分析,得到目标地标点;其中,第一地标预测图像包括待定位图像中像素点的预测地标属性,第一方向预测图像包括待定位图像中像素点的第一方向属性,预测地标属性用于标识像素点对应的地标点,第一方向属性包括指向地标投影的第一方向信息,地标投影表示像素点对应的地标点在 待定位图像中的投影位置。
在一些公开实施例中,图像分析子模块包括:候选区域获取单元,配置为获取具有相同预测地标属性的像素点所构成的候选区域;一致性统计单元,配置为统计候选区域中像素点的第一方向属性的一致性情况;地标确定单元,配置为在一致性情况满足预设条件的情况下,将候选区域中像素点的预测地标属性所标识的地标点作为目标地标点,并基于候选区域中像素点的第一方向属性,得到目标地标点在待定位图像中的第一位置信息。
在一些公开实施例中,图像分析子模块包括:候选区域过滤单元,配置为在候选区域的区域面积小于第二阈值的情况下,过滤候选区域。
在一些公开实施例中,第一方向信息包括第一方向矢量;一致性统计单元,还配置为获取候选区域中像素点之间的第一方向矢量的交点,并统计交点的外点率,得到一致性情况。
在一些公开实施例中,地标检测模型包括特征提取网络、地标预测网络和方向预测网络;图像处理子模块包括特征提取单元,配置为利用地标检测模型处理待定位图像,预测得到第一地标预测图像和第一方向预测图像;地标预测单元,配置为利用地标预测图像对特征图像进行地标预测,得到第一地标预测图像;方向预测单元,配置为利用方向预测网络对特征图像进行方向预测,得到第一方向预测图像。
在一些公开实施例中,地标预测单元,还配置为利用地标预测网络对特征图像进行解码,得到第一特征预测图像,且第一特征预测图像包括待定位图像中像素点的第一特征表示;基于像素点的第一特征表示分别与各个地标点的地标特征表示之间的相似度,得到像素点的预测地标属性;其中,地标特征表示是在地标检测模型训练收敛之后得到的;基于待定位图像中各个像素点的预测地标属性,得到第一地标预测图像。
在一些公开实施例中,目标地标点是利用地标检测模型检测得到的,视觉定位装置1300还包括:投影获取模块,配置为分别确定子区域和地标点在样本图像的投影区域和投影位置;属性确定模块,配置为基于投影区域和投影位置,确定样本图像中样本像素点的样本地标属性和样本方向属性;其中,样本地标属性用于标识样本像素点对应的样本地标点,且样本地标点为投影区域覆盖样本像素点的子区域所含的地标点,样本方向属性包括指向样本像素点对应的样本地标点的投影位置的样本方向信息;样本获取模块,配置为分别基于样本地标属性和样本方向属性,得到样本图像的样本地标图像和样本方向图像;其中,样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性;模型训练模块,配置为利用样本图像、样本地标图像和样本方向图像训练地标检测模型。
在一些公开实施例中,模型训练模块包括:图像预测子模块,配置为利用地标检测模型对样本图像进行预测,得到样本图像的第二特征预测图像和第二方向预测图像;其中,第二特征预测图像包括样本像素点的第二特征表示,第二方向预测图像包括样本像素点的第二方向属性,第二方向属性包括指向样本地标投影的第二方向信息,且样本地标投影表示样本地标点在样本图像中的投影位置;损失计算子模块,配置为基于样本地标图像和第二特征预测图像,得到第一损失,并利用样本方向图像和第二方向预测图像之间的差异,得到第二损失;参数优化子模块,配置为基于第一损失、第二损失,优化地标检测模型的网络参数。
在一些公开实施例中,损失计算子模块包括:图像区域和特征表示获取单元,配置为获取具有相同样本地标属性的样本像素点所构成的图像区域;并获取各个地标点的待优化特征表示;子损失计算单元,配置为对于图像区域中样本像素点,将样本地标属性所标识的样本地标点的待优化特征表示作为样本像素点的正例特征表示,并选择一个参考特征表示作为样本像素点的负例特征表示,以及基于第二特征表示与正例特征表示之 间的第一相似度和第二特征表示与负例特征表示之间的第二相似度,得到子损失;其中,参考特征表示包括除正例特征表示之外的待优化特征表示;损失统计单元,配置为基于样本图像中样本像素点的子损失,得到第一损失。
在一些公开实施例中,子损失计算单元,还配置为统计图像区域中样本像素点的第二特征表示的平均特征表示;基于平均特征表示分别与各个参考特征表示之间的相似度,选择若干参考特征表示作为图像区域的候选特征表示;在候选特征表示中均匀采样,得到样本像素点的负例特征表示。
在一些公开实施例中,参数优化子模块,还配置为基于第一损失和第二损失,优化各个地标点的待优化特征表示和地标检测模型的网络参数。
请参阅图14,图14是本公开电子设备140一实施例的框架示意图。电子设备140包括相互耦接的存储器141和处理器142,处理器142用于执行存储器141中存储的程序指令,以实现上述任一视觉定位方法。在一个实施场景中,电子设备140可以包括但不限于:微型计算机、服务器,此外,电子设备140还可以包括笔记本电脑、平板电脑等移动设备,在此不做限定。
本公开的一些实施例中,处理器142用于控制其自身以及存储器141以实现上述任一视觉定位方法实施例的步骤。处理器142还可以称为中央处理单元(Central Processing Unit,CPU)。处理器142可能是一种集成电路芯片,具有信号的处理能力。处理器142还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,处理器142可以由集成电路芯片共同实现。
上述方案,能够提高视觉定位的准确性和鲁棒性。
请参阅图15,图15为本公开计算机可读存储介质150一实施例的框架示意图。计算机可读存储介质150存储有能够被处理器运行的程序指令151,程序指令151用于实现上述任一视觉定位方法实施例的步骤。
上述方案,能够提高视觉定位的准确性和鲁棒性。
公开实施例还提供一种计算机程序,计算机程序包括计算机可读代码,在计算机可读代码在电子设备中运行的情况下,电子设备的处理器执行如上述任一实施例所述视觉定位方法。
在本公开所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现 出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本公开各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
工业实用性
本申请实施例公开了一种视觉定位方法、装置、设备、介质及程序,其中,视觉定位方法包括:获取对预设场景拍摄到的待定位图像;对待定位图像进行地标检测,得到待定位图像中目标地标点;其中,目标地标点为预设场景的若干地标点中的至少一个,若干地标点是从预设场景的场景地图中选择得到的,场景地图是对预设场景进行三维建模得到的,且若干地标点分别位于场景地图各个子区域的预设位置处;基于目标地标点在待定位图像中的第一位置信息和目标地标点在场景地图中的第二位置信息,得到待定位图像的位姿参数。

Claims (17)

  1. 一种视觉定位方法,所述方法由电子设备执行,所述方法包括:
    获取对预设场景拍摄到的待定位图像;
    对所述待定位图像进行地标检测,得到所述待定位图像中目标地标点;其中,所述目标地标点为所述预设场景的若干地标点中的至少一个,所述若干地标点是从所述预设场景的场景地图中选择得到的,所述场景地图是对所述预设场景进行三维建模得到的,且所述若干地标点分别位于所述场景地图各个子区域的预设位置处;
    基于所述目标地标点在所述待定位图像中的第一位置信息和所述目标地标点在所述场景地图中的第二位置信息,得到所述待定位图像的位姿参数。
  2. 根据权利要求1所述的方法,其中,所述若干子区域是对所述场景地图的表面进行划分得到的;
    和/或,所述预设位置包括所述子区域的中心位置;
    和/或,所述各个子区域之间的面积差异低于第一阈值。
  3. 根据权利要求1或2所述的方法,其中,所述对所述待定位图像进行地标检测,得到所述待定位图像中目标地标点,包括:
    利用地标检测模型处理所述待定位图像,预测得到第一地标预测图像和第一方向预测图像;
    对所述第一地标预测图像和所述第一方向预测图像进行分析,得到所述目标地标点;
    其中,所述第一地标预测图像包括所述待定位图像中像素点的预测地标属性,所述第一方向预测图像包括所述待定位图像中像素点的第一方向属性,所述预测地标属性用于标识所述像素点对应的地标点,所述第一方向属性包括指向地标投影的第一方向信息,所述地标投影表示所述像素点对应的地标点在所述待定位图像中的投影位置。
  4. 根据权利要求3所述的方法,其中,所述对所述第一地标预测图像和所述第一方向预测图像进行分析,得到所述目标地标点,包括:
    获取具有相同所述预测地标属性的像素点所构成的候选区域;
    统计所述候选区域中所述像素点的第一方向属性的一致性情况;
    在所述一致性情况满足预设条件的情况下,将所述候选区域中所述像素点的预测地标属性所标识的地标点作为所述目标地标点,并基于所述候选区域中所述像素点的第一方向属性,得到所述目标地标点在所述待定位图像中的第一位置信息。
  5. 根据权利要求4所述的方法,其中,在所述统计所述候选区域中所述像素点的第一方向属性的一致性情况之前,所述方法还包括:
    在所述候选区域的区域面积小于第二阈值的情况下,过滤所述候选区域。
  6. 根据权利要求4或5所述的方法,其中,所述第一方向信息包括第一方向矢量;所述统计所述候选区域中所述像素点的第一方向属性的一致性情况,包括:
    获取所述候选区域中所述像素点之间的第一方向矢量的交点;
    统计所述交点的外点率,得到所述一致性情况。
  7. 根据权利要求3所述的方法,其中,所述地标检测模型包括特征提取网络、地标预测网络和方向预测网络;所述利用地标检测模型处理所述待定位图像,预测得到第一地标预测图像和第一方向预测图像,包括:
    利用所述特征提取网络对所述待定位图像进行特征提取,得到特征图像;
    利用所述地标预测网络对所述特征图像进行地标预测,得到所述第一地标预测图像;以及,
    利用所述方向预测网络对所述特征图像进行方向预测,得到所述第一方向预测图像。
  8. 根据权利要求7所述的方法,其中,所述利用所述地标预测网络对所述特征图像进行地标预测,得到所述第一地标预测图像,包括:
    利用所述地标预测网络对所述特征图像进行解码,得到第一特征预测图像;其中,所述第一特征预测图像包括所述待定位图像中所述像素点的第一特征表示;
    基于所述像素点的第一特征表示分别与各个所述地标点的地标特征表示之间的相似度,得到所述像素点的预测地标属性;其中,所述地标特征表示是在所述地标检测模型训练收敛之后得到的;
    基于所述待定位图像中各个所述像素点的预测地标属性,得到所述第一地标预测图像。
  9. 根据权利要求3至8任一项所述的方法,其中,所述目标地标点是利用地标检测模型检测得到的,所述地标检测模型的训练步骤包括:
    分别确定所述子区域和所述地标点在样本图像的投影区域和投影位置;
    基于所述投影区域和所述投影位置,确定所述样本图像中样本像素点的样本地标属性和样本方向属性;其中,所述样本地标属性用于标识所述样本像素点对应的样本地标点,且所述样本地标点为所述投影区域覆盖所述样本像素点的子区域所含的地标点,所述样本方向属性包括指向所述样本像素点对应的样本地标点的投影位置的样本方向信息;
    分别基于所述样本地标属性和所述样本方向属性,得到所述样本图像的样本地标图像和样本方向图像;其中,所述样本地标图像中第一像素点标注有对应的样本像素点的样本地标属性,所述样本方向图像中第二像素点标注有对应的样本像素点的样本方向属性;
    利用所述样本图像、所述样本地标图像和所述样本方向图像训练所述地标检测模型。
  10. 根据权利要求9所述的方法,其中,所述利用所述样本图像、所述样本地标图像和所述样本方向图像训练所述地标检测模型,包括:
    利用所述地标检测模型对所述样本图像进行预测,得到所述样本图像的第二特征预测图像和第二方向预测图像;其中,所述第二特征预测图像包括所述样本像素点的第二特征表示,所述第二方向预测图像包括所述样本像素点的第二方向属性,所述第二方向属性包括指向样本地标投影的第二方向信息,且所述样本地标投影表示所述样本地标点在所述样本图像中的投影位置;
    基于所述样本地标图像和所述第二特征预测图像,得到第一损失,并利用所述样本方向图像和所述第二方向预测图像之间的差异,得到第二损失;
    基于所述第一损失、所述第二损失,优化所述地标检测模型的网络参数。
  11. 根据权利要求10所述的方法,其中,所述基于所述样本地标图像和所述第二特征预测图像,得到第一损失,包括:
    获取具有相同所述样本地标属性的样本像素点所构成的图像区域,并获取各个所述地标点的待优化特征表示;
    对于所述图像区域中所述样本像素点,将所述样本地标属性所标识的样本地标点的待优化特征表示作为所述样本像素点的正例特征表示,并选择一个参考特征表示作为所述样本像素点的负例特征表示,以及基于所述第二特征表示与所述正例特征表示之间的第一相似度和所述第二特征表示与所述负例特征表示之间的第二相似度,得到子损失;其中,所述参考特征表示包括除所述正例特征表示之外的待优化特征表示;
    基于所述样本图像中所述样本像素点的子损失,得到所述第一损失。
  12. 根据权利要求11所述的方法,其中,所述选择一个参考特征表示作为所述样本像素点的负例特征表示,包括:
    统计所述图像区域中样本像素点的第二特征表示的平均特征表示;
    基于所述平均特征表示分别与各个所述参考特征表示之间的相似度,选择若干所述参考特征表示作为所述图像区域的候选特征表示;
    在所述候选特征表示中均匀采样,得到所述样本像素点的负例特征表示。
  13. 根据权利要求10所述的方法,其中,所述基于所述第一损失、所述第二损失,优化所述地标检测模型的网络参数,包括:
    基于所述第一损失和所述第二损失,优化各个所述地标点的待优化特征表示和所述地标检测模型的网络参数。
  14. 一种视觉定位装置,包括:
    信息获取模块,配置为获取对预设场景拍摄到的待定位图像;
    地标检测模块,配置为对所述待定位图像进行地标检测,得到所述待定位图像中目标地标点;其中,所述目标地标点为所述预设场景的若干地标点中的至少一个,所述若干地标点是从所述预设场景的场景地图中选择得到的,所述场景地图是对所述预设场景进行三维建模得到的,且所述若干地标点分别位于所述场景地图各个子区域的预设位置处;
    位姿确定模块,配置为基于所述目标地标点在所述待定位图像中的第一位置信息和所述目标地标点在所述场景地图中的第二位置信息,得到所述待定位图像的位姿参数。
  15. 一种电子设备,包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,以实现权利要求1至13任一项所述的视觉定位方法。
  16. 一种计算机可读存储介质,其上存储有程序指令,所述程序指令被处理器执行时实现权利要求1至13任一项所述的视觉定位方法。
  17. 一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现如权利要求1至13任一项所述的视觉定位方法。
PCT/CN2021/126039 2021-05-24 2021-10-25 视觉定位方法、装置、设备、介质及程序 WO2022247126A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110564566.7 2021-05-24
CN202110564566.7A CN113240656B (zh) 2021-05-24 2021-05-24 视觉定位方法及相关装置、设备

Publications (1)

Publication Number Publication Date
WO2022247126A1 true WO2022247126A1 (zh) 2022-12-01

Family

ID=77138467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126039 WO2022247126A1 (zh) 2021-05-24 2021-10-25 视觉定位方法、装置、设备、介质及程序

Country Status (3)

Country Link
CN (1) CN113240656B (zh)
TW (1) TW202247108A (zh)
WO (1) WO2022247126A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220343107A1 (en) * 2021-04-22 2022-10-27 Procore Technologies, Inc. Drawing Matching Tool

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240656B (zh) * 2021-05-24 2023-04-07 浙江商汤科技开发有限公司 视觉定位方法及相关装置、设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046125A (zh) * 2019-12-16 2020-04-21 视辰信息科技(上海)有限公司 一种视觉定位方法、***及计算机可读存储介质
CN111415388A (zh) * 2020-03-17 2020-07-14 Oppo广东移动通信有限公司 一种视觉定位方法及终端
US20200364509A1 (en) * 2019-05-16 2020-11-19 Naver Corporation System and method for training a neural network for visual localization based upon learning objects-of-interest dense match regression
CN112284394A (zh) * 2020-10-23 2021-01-29 北京三快在线科技有限公司 一种地图构建及视觉定位的方法及装置
WO2021027692A1 (zh) * 2019-08-09 2021-02-18 华为技术有限公司 视觉特征库的构建方法、视觉定位方法、装置和存储介质
CN112700468A (zh) * 2019-10-23 2021-04-23 浙江商汤科技开发有限公司 位姿确定方法及装置、电子设备和存储介质
CN113240656A (zh) * 2021-05-24 2021-08-10 浙江商汤科技开发有限公司 视觉定位方法及相关装置、设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229494B (zh) * 2017-06-16 2020-10-16 北京市商汤科技开发有限公司 网络训练方法、处理方法、装置、存储介质和电子设备
CN109325967B (zh) * 2018-09-14 2023-04-07 腾讯科技(深圳)有限公司 目标跟踪方法、装置、介质以及设备
CN110032962B (zh) * 2019-04-03 2022-07-08 腾讯科技(深圳)有限公司 一种物体检测方法、装置、网络设备和存储介质
CN110796135A (zh) * 2019-09-20 2020-02-14 平安科技(深圳)有限公司 目标的定位方法及装置、计算机设备、计算机存储介质
CN111862205B (zh) * 2019-12-18 2024-06-21 北京嘀嘀无限科技发展有限公司 一种视觉定位方法、装置、设备及存储介质
CN112328715B (zh) * 2020-10-16 2022-06-03 浙江商汤科技开发有限公司 视觉定位方法及相关模型的训练方法及相关装置、设备
CN112767538B (zh) * 2021-01-11 2024-06-07 浙江商汤科技开发有限公司 三维重建及相关交互、测量方法和相关装置、设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200364509A1 (en) * 2019-05-16 2020-11-19 Naver Corporation System and method for training a neural network for visual localization based upon learning objects-of-interest dense match regression
WO2021027692A1 (zh) * 2019-08-09 2021-02-18 华为技术有限公司 视觉特征库的构建方法、视觉定位方法、装置和存储介质
CN112700468A (zh) * 2019-10-23 2021-04-23 浙江商汤科技开发有限公司 位姿确定方法及装置、电子设备和存储介质
CN111046125A (zh) * 2019-12-16 2020-04-21 视辰信息科技(上海)有限公司 一种视觉定位方法、***及计算机可读存储介质
CN111415388A (zh) * 2020-03-17 2020-07-14 Oppo广东移动通信有限公司 一种视觉定位方法及终端
CN112284394A (zh) * 2020-10-23 2021-01-29 北京三快在线科技有限公司 一种地图构建及视觉定位的方法及装置
CN113240656A (zh) * 2021-05-24 2021-08-10 浙江商汤科技开发有限公司 视觉定位方法及相关装置、设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220343107A1 (en) * 2021-04-22 2022-10-27 Procore Technologies, Inc. Drawing Matching Tool
US11841924B2 (en) * 2021-04-22 2023-12-12 Procore Technologies, Inc. Drawing matching tool

Also Published As

Publication number Publication date
CN113240656A (zh) 2021-08-10
TW202247108A (zh) 2022-12-01
CN113240656B (zh) 2023-04-07

Similar Documents

Publication Publication Date Title
JP7453470B2 (ja) 3次元再構成及び関連インタラクション、測定方法及び関連装置、機器
CN107330439B (zh) 一种图像中物体姿态的确定方法、客户端及服务器
WO2019218824A1 (zh) 一种移动轨迹获取方法及其设备、存储介质、终端
US11928800B2 (en) Image coordinate system transformation method and apparatus, device, and storage medium
CN111046125A (zh) 一种视觉定位方法、***及计算机可读存储介质
CN110568447A (zh) 视觉定位的方法、装置及计算机可读介质
CN110363817B (zh) 目标位姿估计方法、电子设备和介质
CN109919971B (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
WO2022247126A1 (zh) 视觉定位方法、装置、设备、介质及程序
Han et al. CAD-based 3D objects recognition in monocular images for mobile augmented reality
WO2021136386A1 (zh) 数据处理方法、终端和服务器
CN108510520B (zh) 一种图像处理方法、装置及ar设备
CN110222572A (zh) 跟踪方法、装置、电子设备及存储介质
Li et al. RGBD relocalisation using pairwise geometry and concise key point sets
CN112907569A (zh) 头部图像区域的分割方法、装置、电子设备和存储介质
CN115937546A (zh) 图像匹配、三维图像重建方法、装置、电子设备以及介质
CN112085534A (zh) 一种关注度分析方法、***及存储介质
Li et al. Image‐Based Indoor Localization Using Smartphone Camera
WO2022126921A1 (zh) 全景图片的检测方法、装置、终端及存储介质
CN112215036B (zh) 跨镜追踪方法、装置、设备及存储介质
Osuna-Coutiño et al. Structure extraction in urbanized aerial images from a single view using a CNN-based approach
CN110135474A (zh) 一种基于深度学习的倾斜航空影像匹配方法和***
CN113570535A (zh) 视觉定位方法及相关装置、设备
Bajramovic et al. Global Uncertainty-based Selection of Relative Poses for Multi Camera Calibration.
Gupta et al. Reconnoitering the Essentials of Image and Video Processing: A Comprehensive Overview

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21942690

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21942690

Country of ref document: EP

Kind code of ref document: A1