CN115841650A

CN115841650A - Visual positioning method, visual positioning device, electronic equipment and readable storage medium

Info

Publication number: CN115841650A
Application number: CN202211548514.1A
Authority: CN
Inventors: 唐舟进; 陈娜; 王浩; 于晓菲; 郭树盛
Original assignee: Beijing Digital City Research Center
Current assignee: Beijing Digital City Research Center
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-03-24
Anticipated expiration: 2042-12-05
Also published as: CN115841650B

Abstract

The invention provides a visual positioning method, a visual positioning device, electronic equipment and a readable storage medium, and relates to the technical field of vision, wherein the method comprises the following steps: detecting a target object in a first video data stream to obtain a characteristic label of the target object and a first track of the target object, wherein the first video data stream is a video data stream collected by a first camera; detecting the target object in the second video data stream according to the characteristic label of the target object to obtain a second track of the target object, wherein the second video data stream is a video data stream collected by a second camera; and generating a continuous motion track of the target object in a target area according to the first track of the target object and the second track of the target object, wherein the target area comprises the acquisition area of the first camera and the acquisition area of the second camera. The method reduces the demand on calculation power, and also has higher vision positioning accuracy under the condition that the performance of the vision positioning system is poor.

Description

Visual positioning method, visual positioning device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of vision technologies, and in particular, to a visual positioning method, a visual positioning apparatus, an electronic device, and a readable storage medium.

Background

The visual positioning technology for identifying the identity of the pedestrian based on the computer vision gradually becomes a research hotspot in the field of smart cities, and has an important role in maintaining social order and safety.

At present, video data acquired by a camera generally needs to be transmitted to a cloud end, and the video data is analyzed through a visual positioning system deployed at the cloud end, however, the problems of large bandwidth requirement, high transmission delay and the like exist in network remote transmission, and particularly under the conditions that the data volume is large and high-frequency mutual calling is needed, for example, when multiple cameras detect pedestrians at the same time, the pedestrians go and go between different cameras, the characteristics of the pedestrians need to be identified each time to confirm the positions of the pedestrians under different cameras, and then position tracking is performed, so that the situation of positioning failure caused by time delay easily occurs. The visual positioning system is deployed on the edge computing platform, and due to the constraints of the energy consumption requirement of the edge computing equipment, low computing capacity and the like, the visual positioning system is often difficult to successfully deploy, and the performance is difficult to meet the requirements.

Therefore, the problem of low accuracy of visual positioning exists when the performance of the visual positioning system is poor in the prior art.

Disclosure of Invention

The embodiment of the invention provides a visual positioning method, a visual positioning device, electronic equipment and a readable storage medium, which aim to solve the problem of low visual positioning accuracy when the performance of a visual positioning system is poor in the prior art

In a first aspect, an embodiment of the present invention provides a visual positioning method, where the method includes:

detecting a target object in a first video data stream to obtain a feature tag of the target object and a first track of the target object, wherein the first video data stream is a video data stream collected by a first camera;

detecting the target object in a second video data stream according to the characteristic label of the target object to obtain a second track of the target object, wherein the second video data stream is a video data stream collected by a second camera;

and generating a continuous motion track of the target object in a target area according to the first track of the target object and the second track of the target object, wherein the target area comprises the acquisition area of the first camera and the acquisition area of the second camera.

In a second aspect, an embodiment of the present invention provides a visual positioning apparatus, including:

the first detection module is used for detecting a target object in a first video data stream to obtain a characteristic label of the target object and a first track of the target object, wherein the first video data stream is a video data stream collected by a first camera;

the second detection module is used for detecting the target object in a second video data stream according to the characteristic label of the target object to obtain a second track of the target object, wherein the second video data stream is a video data stream collected by a second camera;

and the generating module is used for generating a continuous motion track of the target object in a target area according to the first track of the target object and the second track of the target object, wherein the target area comprises an acquisition area of the first camera and an acquisition area of the second camera.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method according to the first aspect.

In the embodiment of the invention, in the target area, the first video data stream of the first camera acquisition area is obtained, and the target object in the first video data stream is detected to obtain the characteristic label of the target object and the first track of the target object, when the target object in the second video data stream of the second camera acquisition area is detected, the corresponding target object can be quickly detected according to the characteristic label of the target object, and the second track associated with the first track is generated, so that the target object is associated across video frames, the accuracy of detecting the target object in the video data streams acquired by different cameras is improved, and the continuity of the motion track of the target object in the target area is improved, thereby avoiding repeated characteristic identification of the target object in different video data streams, and reducing the demand on computing power.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic flow chart of a visual positioning method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a visual positioning system provided by an embodiment of the present invention;

fig. 3 is a schematic flowchart of determining a location of a target object according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a visual positioning apparatus according to an embodiment of the present invention.

Detailed Description

The terms first, second and the like in the description and in the claims of the present invention are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the structures so used are interchangeable under appropriate circumstances such that embodiments of the invention may be practiced in sequences other than those illustrated or described herein, and that the terms "first", "second", etc. are generally used herein as a class and do not limit the number of terms, for example, a first term can be one or more than one.

A plurality of cameras can be arranged in the target area, and the target area is monitored and covered by the plurality of cameras. The target area can be a building, a street, a garden and other urban spaces, at least one camera can be deployed in each scene, for example, in the scene of the building, 8 cameras can be deployed in a hall, and each area in the hall is monitored and covered by the 8 cameras; each meeting room in 5 meeting rooms adjacent to the hall is provided with 4 cameras, and each area in the meeting room is monitored and covered by the 4 cameras corresponding to each meeting room.

The plurality of cameras may include a first camera and a second camera. The first camera and the second camera may each be monocular cameras. The area covered by the first camera and the area covered by the second camera can be different areas in the same environment, for example, the first camera and the second camera can be arranged in a lobby of the first building, the monitoring coverage range of the first camera is the lobby part of the lobby, and the monitoring coverage range of the second camera is the lobby part of the lobby; the area covered by the first camera and the area covered by the second camera may also be different areas in different environments, for example, the first camera may be disposed in a lobby of a first building and the second camera may be disposed in a lobby of a second building adjacent to the first building. It should be noted that the positions where the first camera and the second camera are deployed may also be other scenes, and the same technical effect may be achieved, which is not described herein again.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a visual positioning method provided by an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step 101, detecting a target object in a first video data stream to obtain a feature tag of the target object and a first track of the target object, wherein the first video data stream is a video data stream collected by a first camera;

in the task of continuously tracking and positioning the pedestrians in a multi-path video scene by using a monocular camera, a target object can be one or more pedestrians in a first video data stream acquired by a first camera, and the target object, namely the pedestrian, in the first video data stream can be detected by a target detection module arranged on an edge computing platform; the target re-identification module arranged on the edge computing platform can identify the characteristics of clothing, body type, hair style, posture and the like of the target object to obtain the characteristic tags of the target object, each target object has a corresponding characteristic tag, and different target objects can be distinguished through the characteristic tags; through a target tracking module arranged on the edge computing platform, the target object can be continuously tracked to obtain a first track of the target object.

102, detecting the target object in a second video data stream according to the characteristic label of the target object to obtain a second track of the target object, wherein the second video data stream is a video data stream collected by a second camera;

the second camera can be arranged in an area which cannot be covered by the first camera, in other words, the target area can be covered by the first camera and the second camera, so that the monitoring blind area is reduced. When the second video data stream collected by the second camera is detected, the feature tags of the target objects acquired in the target re-identification module can be read, the corresponding target objects are quickly detected according to the feature tags, and the target objects are continuously tracked to obtain a second track of the target objects, so that the target objects are associated across the video frames.

Step 103, generating a continuous motion track of the target object in a target area according to the first track of the target object and the second track of the target object, wherein the target area comprises an acquisition area of the first camera and an acquisition area of the second camera.

The first track can be a motion track of a target object in an acquisition area of the first camera, the second track can be a motion track of the target object in an acquisition area of the second camera, and the target object is identified in a second video data stream acquired by the second camera according to a feature tag corresponding to the target object, so that the target object is associated across video frames, the second track of the same target object is associated with the first track, and the continuity of the motion track of the target object in the target area is improved.

In the embodiment of the invention, in the target area, the first video data stream of the first camera acquisition area is obtained, and the target object in the first video data stream is detected to obtain the characteristic label of the target object and the first track of the target object, when the target object in the second video data stream of the second camera acquisition area is detected, the corresponding target object can be rapidly detected according to the characteristic label of the target object, and the second track associated with the first track is generated, so that the cross-video frame association of the target object is realized, the accuracy of detecting the target object in the video data streams acquired by different cameras and the continuity of the motion track of the target object in the target area are improved, the characteristic identification of the target object in different video data streams is avoided being repeated, and the demand on computational power is reduced, therefore, under the condition that the performance of a visual positioning system is poor, the visual positioning accuracy can be higher.

The visual positioning system can be deployed on the edge computing platform, and compared with a scheme that the visual positioning system is deployed on a cloud end, the visual positioning system is deployed on the edge computing platform, so that the communication cost of data in a transmission process is reduced, and the visual positioning system is more suitable for a task of continuous tracking and positioning which needs high-frequency mutual calling.

Optionally, the detecting a target object in the first video data stream to obtain a feature tag of the target object and a first track of the target object includes:

detecting a target object in a first video data stream to obtain a characteristic label of the target object;

obtaining the position of the target object in each video frame of the first video data stream according to the feature tag of the target object;

and obtaining a first track of the target object according to the position of the target object in each video frame of the first video data stream.

In one embodiment, as shown in FIG. 2, a visual positioning system deployed on an edge computing platform may include a camera calibration module, a target detection module, a target tracking module, a target re-identification module, and a target positioning module.

Through the target detection module, a target object in the first video data stream and a target object in the second video data stream can be detected;

through the target re-identification module, the characteristics of clothing, body type, hair style, posture and the like of the target object in the first video data stream can be identified to obtain the characteristic tags of the target object, each target object has a corresponding characteristic tag, different target objects can be distinguished through the characteristic tags, in other words, the target objects with the same characteristic tags can be regarded as the same object, and therefore the target objects crossing the video frame can be associated based on the same characteristic tags;

and obtaining the position of the target object in each video frame of the first video data stream through the target tracking module according to the characteristic label corresponding to the target object, thereby realizing continuous tracking of the target object in the acquisition area of the first camera and obtaining the first track of the target object according to the position of the target object in each video frame. The target tracking module may adopt a Deep Sort network.

When the target object moves to the acquisition area corresponding to the second camera, the target object in the first video data stream detected by the target detection module and the target object in the second video data stream can be associated based on the feature tag of the target object acquired by the target re-identification module, and the position of the target object in each video frame of the second video data stream is obtained through the target tracking module, so that the target object can be continuously tracked in the acquisition area of the second camera, the generated second track can be associated with the first track, the motion track of the target object can be associated across the video frames, and the accuracy of detecting the target object in the video data streams acquired by different cameras is improved.

The visual positioning system deployed on the edge computing platform can comprise a function display module, and through the function display module, flow information in a target area, hot spot area/interesting area statistical information in a period of time, flow distribution thermodynamic diagram information in a dynamic area, pedestrian social distance monitoring information, pedestrian absolute position x/y coordinate output information, pedestrian identity identification information, multi-camera descending person track map information and the like can be generated and displayed according to analysis on the motion track of a target object.

Optionally, the obtaining a first track of the target object according to the position of the target object in each video frame of the first video data stream includes:

and mapping coordinates corresponding to the position of the target object in each video frame of the first video data stream to a target coordinate system to obtain a first track of the target object, wherein the target coordinate system is a coordinate system obtained by pre-calibrating the first camera and the second camera.

In one example, the camera calibration module can be used for obtaining internal and external parameters and distortion coefficients of the camera, and a plane in any video frame can be mapped onto a target coordinate system plane through rotation and translation, so that the measurement of the size and the position of any plane in a three-dimensional space is realized. The method can be used for obtaining the internal reference matrix, the external reference matrix and the distortion coefficient of the first camera and the second camera by adopting a Zhang-Zhengyou calibration method.

Specifically, after obtaining an image of a calibration plate, a checkerboard calibration plate may be used to obtain a pixel coordinate (u, v) of each corner point by using a corresponding image detection algorithm. The Zhangyingyou calibration method fixes a world coordinate system, namely a target coordinate system, on a checkerboard, so that the physical coordinate W =0 of any point on the checkerboard, and as the world coordinate system of a calibration board is artificially defined in advance, the size of each grid on the calibration board is known, and the physical coordinate (U, V, W = 0) of each corner point under the world coordinate system can be calculated. With this information: and calibrating the camera by using the pixel coordinates (U, V) of each angular point and the physical coordinates (U, V, W = 0) of each angular point in a world coordinate system to obtain an internal and external parameter matrix and a distortion parameter of the camera, and converting the world coordinates through the coordinate system to output the world coordinates. Further, the coordinates corresponding to the position of the target object in each video frame of the first video data stream may be mapped to the target coordinate system to obtain a first track of the target object, and the coordinates corresponding to the position of the target object in each video frame of the second video data stream may be mapped to the target coordinate system to obtain a second track of the target object.

In addition, internal and external parameters and distortion coefficients of the camera are obtained through camera calibration, and distortion generated by the camera can be eliminated, so that images are corrected, and the accuracy of detection results is improved.

Optionally, the obtaining, according to the feature tag of the target object, a position of the target object in each video frame of the first video data stream includes:

obtaining a rectangular frame where the target object is located in each video frame of the first video data stream according to the feature tag of the target object;

and determining the position of the target object according to the midpoint coordinate of the rectangular frame.

In an example, after the first video data stream is acquired, according to the feature tag corresponding to the target object, the target object in each video frame of the first video data stream may be found out by the target detection module and labeled in a rectangular frame form, and the target object in each video frame of the second video data stream may be found out by the target detection module and labeled in a rectangular frame form, and the rectangular frames corresponding to the target objects having the same feature tag are associated, so that the rectangular frames corresponding to the target objects are associated across the video frames.

When the position of the target object in each video frame of the video data stream (including the video data stream and the second video data stream) is obtained by the target tracking module, the position of the target object can be obtained according to the midpoint coordinate of the rectangular frame, in other words, the midpoint coordinate of the tracking rectangular frame is used to form a motion track (including the first track and the second track), so that the accuracy of the target tracking module in positioning and tracking is improved.

Optionally, the determining, according to the midpoint coordinate of the rectangular frame, the position where the target object is located includes:

determining the middle point of the rectangular frame as the position of the target object under the condition that the rectangular frame where the target object is located contains the heel point characteristics of the target object;

under the condition that a rectangular frame where the target object is located does not contain the heel point feature of the target object, calculating a positioning point of the rectangular frame according to the middle point of the rectangular frame, the relative height of the target object, the X-axis walking speed of the target object and the Y-axis walking speed of the target object;

and determining the positioning point of the rectangular frame as the position of the target object.

In an example, as shown in fig. 3, the camera calibration module obtains internal and external parameters and distortion coefficients of the camera, so as to map a plane in any video frame onto a target coordinate system plane through rotation and translation, thereby realizing measurement of the size and position on any plane in a three-dimensional space. Therefore, under the condition that the rectangular frame where the target object is located does not contain the heel point a feature of the target object, in other words, when people are dense and the occlusion phenomenon is severe, so that the heel point a of the target object cannot be identified, the target positioning module can calculate the positioning point of the rectangular frame according to the midpoint of the rectangular frame, the relative height of the target object, the X-axis walking speed of the target object and the Y-axis walking speed of the target object. The positioning point of the rectangular frame may be a midpoint coordinate of the rectangular frame corresponding to the target object under the condition that the target object is not occluded.

Specifically, the world coordinate system of the calibration board is artificially defined in advance, and the size of each grid on the calibration board is known, so that the relative height of the target object can be calculated according to the position B of the target object which is not shielded by designing a plurality of parallel surfaces parallel to the ground, the X-axis walking speed of the target object and the Y-axis walking speed of the target object can be calculated according to the midpoint coordinate of the current rectangular frame and the difference value between the coordinates of different positions of the target object in different video frames within a unit time interval, and the absolute position of the midpoint of the top frame of the tracking frame, namely the positioning point of the rectangular frame, is calculated by using the relative height of the target object, the X-axis walking speed of the target object and the Y-axis walking speed of the target object; and the positioning point of the rectangular frame is determined as the position of the target object, so that the problem of target shielding is effectively solved, meanwhile, the algorithm is simple to calculate, and the Central Processing Unit (CPU) resource is hardly consumed.

When the pedestrian sparseness and the shielding phenomenon are not serious, in other words, under the condition that the rectangular frame where the target object is located contains the heel point a feature of the target object, the target positioning module can directly determine the middle point of the rectangular frame as the position where the target object is located so as to track the target object.

Optionally, the detecting a target object in the first video data stream includes:

and carrying out position detection on the target object in the first video data stream through a lightweight Yolov5n network model.

In one example, the target detection may employ an improved lightweight Yolov5n detection method. That is, the position of the target object in the first video data stream and the first video data stream is detected through the light-weighted Yolov5n network model, so that the reduction of the positioning accuracy can be reduced while the light weight is realized. The performance requirement on the deployment equipment is reduced, so that the visual positioning accuracy can be higher under the condition that the performance of the visual positioning system is poor.

Specifically, in the process of training the Yolov5n network model, the Yolov5n network model may be a single detector trained by using a self-constructed pedestrian data set, for example, the size of the data set may be 10 ten thousand, and the single detector is used for detecting a picture of a pedestrian; secondly, a structural parameterization technology can be adopted to prune the Yolov5n network model to eliminate redundant parameters in the trained network model, so that the complexity of the model is reduced, and the pruning proportion can be 50% to quantize the network to 8 bits. Finally, the performance test is carried out on the pruned model, and the accuracy of the quantized Yolov5n network model is almost not lost. The performance requirement on the edge equipment when the Yolov5n network model is deployed is reduced, and the visual positioning accuracy can be higher under the condition that the performance of the visual positioning system is poorer.

and carrying out feature detection on the target object in the first video data stream through a lightweight Osnet network model.

In an example, in the process of target identity recognition, feature detection can also be performed on a target object in a first video data stream through a lightweight Osnet network model, so that the purpose of detecting the re-recognition identity of the pedestrian in the cross-camera scene is achieved.

Specifically, an improved identification method of a lightweight Osnet network may be employed. Firstly, re-identifying a data set by self-building a target, wherein the scale of the data set can be 20 thousands, modifying an Osnet convolution and Block structure, and training an Osnet network model; secondly, pruning the modified network by 50 percent to quantize the network into 8 bits; finally, network performance testing, the accuracy of the quantified Osnet network model is reduced by 0.3%, and the network complexity is greatly reduced. The performance requirements on edge equipment when an Osnet network model is deployed are reduced, and the visual positioning accuracy can be higher under the condition that the performance of a visual positioning system is poor.

Therefore, through system design and optimization of a network structure, the system realizes the task of continuous tracking and positioning of pedestrians in a multi-channel video scene based on a monocular camera, effectively solves the problem of shielding, has almost the same identification and positioning tracking precision as a mainstream scheme, greatly reduces computational cost, and can be deployed on edge computing equipment.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a visual positioning apparatus provided in an embodiment of the present invention, and the visual positioning apparatus 400 includes:

the first detection module 401 is configured to detect a target object in a first video data stream to obtain a feature tag of the target object and a first track of the target object, where the first video data stream is a video data stream acquired by a first camera;

a second detection module 402, configured to detect the target object in a second video data stream according to the feature tag of the target object to obtain a second track of the target object, where the second video data stream is a video data stream acquired by a second camera;

a generating module 403, configured to generate a continuous motion trajectory of the target object in a target area according to the first trajectory of the target object and the second trajectory of the target object, where the target area includes a collection area of the first camera and a collection area of the second camera.

Optionally, the first detection module 401 includes:

the first calculation submodule is used for detecting a target object in a first video data stream to obtain a characteristic label of the target object;

the second calculation submodule is used for obtaining the position of the target object in each video frame of the first video data stream according to the characteristic label of the target object;

and the third calculation submodule is used for obtaining a first track of the target object according to the position of the target object in each video frame of the first video data stream.

Optionally, the third computing submodule comprises:

a mapping unit, configured to map, in each video frame of the first video data stream, a coordinate corresponding to a position where the target object is located to a target coordinate system to obtain a first track of the target object, where the target coordinate system is a coordinate system obtained by calibrating the first camera and the second camera in advance.

Optionally, the second computation submodule includes:

the computing unit is used for obtaining a rectangular frame where the target object is located in each video frame of the first video data stream according to the feature tag of the target object;

and the determining unit is used for determining the position of the target object according to the midpoint coordinate of the rectangular frame.

Optionally, the determining unit includes:

a first determining subunit, configured to determine, when a rectangular frame in which the target object is located includes a heel point feature of the target object, a midpoint of the rectangular frame as a position where the target object is located;

the calculating subunit is configured to calculate, when a rectangular frame in which the target object is located does not include the heel point feature of the target object, a positioning point of the rectangular frame according to a midpoint of the rectangular frame, a relative height of the target object, an X-axis walking speed of the target object, and a Y-axis walking speed of the target object;

and the second determining subunit is used for determining the positioning point of the rectangular frame as the position of the target object.

Optionally, the first detection module 401 includes:

and the first detection submodule is used for carrying out position detection on the target object in the first video data stream through a lightweight Yolov5n network model.

Optionally, the first detection module 401 includes:

and the second detection submodule is used for carrying out feature detection on the target object in the first video data stream through a lightweight Osnet network model.

It should be noted that the visual positioning apparatus can implement each process in the method embodiment shown in fig. 1, and details are not described here to avoid repetition.

An embodiment of the present invention further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the processes of the above embodiment of the visual positioning method, and achieve the same technical effects, and further description is omitted here to avoid repetition.

The embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer instruction, where the computer instruction is used to enable the computer to execute each process of the above-mentioned visual positioning method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer readable storage medium is, for example, ROM, RAM, magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus of embodiments of the present invention is not limited to performing functions in the order discussed, but may include performing functions in a substantially simultaneous manner or in a reverse order depending on the functionality involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A visual positioning method, characterized in that the method comprises:

2. The method of claim 1, wherein the detecting the target object in the first video data stream to obtain the feature tag of the target object and the first track of the target object comprises:

3. The method of claim 2, wherein obtaining the first track of the target object according to the position of the target object in each video frame of the first video data stream comprises:

4. The method according to claim 2, wherein the position of the target object in each video frame of the first video data stream is obtained according to the feature tag of the target object,

the method comprises the following steps:

5. The method according to claim 4, wherein the determining the position of the target object according to the midpoint coordinate of the rectangular frame comprises:

6. The method of claim 1, wherein detecting the target object in the first video data stream comprises:

7. The method of claim 1, wherein detecting the target object in the first video data stream comprises:

8. A visual positioning device, comprising:

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.