CN112132864A

CN112132864A - Robot following method based on vision and following robot

Info

Publication number: CN112132864A
Application number: CN202010993247.3A
Authority: CN
Inventors: 于峰
Original assignee: Dalian Aoyou Intelligent Technology Co ltd
Current assignee: Dalian Aoyou Intelligent Technology Co ltd
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-12-25
Anticipated expiration: 2040-09-21
Also published as: CN112132864B

Abstract

The invention provides a robot following method based on vision and a following robot, and relates to the technical field of computer vision. A vision-based robot following method comprising the steps of: receiving target object information; the method comprises the steps that a vision camera acquires a visual field input image, the visual field input image is preprocessed to generate a detection input image, and the detection input image is input into a pedestrian detection neural network model for detection; the detection input image is formed by splicing a plurality of images which have different resolutions and are related to the view field input image; acquiring a pedestrian detection result, and determining a following target in the pedestrian according to the target object information; and carrying out robot following on the following target. The invention reduces the requirement of target tracking on the calculation power of equipment and reduces the power consumption on the basis of ensuring the real-time accuracy of far and near target tracking.

Description

Robot following method based on vision and following robot

Technical Field

The invention relates to the technical field of computer vision.

Background

The intelligent mobile robot for following the moving target is widely applied to the fields of home service, old and disabled assisting, scene monitoring, intelligent vehicles and the like, and has wide application prospects. The following of a target object by a mobile robot relates to the fields of computer vision, motion control, pattern recognition and the like. For the robot vision, the aim is to simulate a human vision mechanism, calculate the importance degree of information in a visual scene, and extract interesting salient features or target object features in an image.

The following process of a vision-based following robot typically includes image acquisition, target detection, and target tracking. With the rapid development of artificial intelligence and deep learning techniques, a target detection method based on a Convolutional Neural Network (CNN) algorithm is widely applied. Compared with the traditional machine vision method, the convolutional neural network learns useful characteristics from a large amount of data under the training of big data, has the advantages of high speed, high precision, low cost and the like, and the convolutional neural network algorithm is also applied to pedestrian detection following the robot, for example, Chinese patent application CN2020101552071 discloses a pedestrian autonomous following method based on the vision for the quadruped robot, and a pedestrian detection model of the method is based on the convolutional neural network algorithm.

At present, when people use a following robot to visually track a target, the robot is generally expected to have better tracking capability not only for a near target, but also for a far target, namely better tracking capability for both the far target and the near target. However, although the convolutional neural network algorithm improves the real-time accuracy of tracking, the convolutional neural network based target detection algorithm usually includes a large number of computation-intensive operations, and thus the requirements for real-time detection power and bandwidth are high. In particular, in order to detect objects at different distances, the current common methods are: carrying out multi-scale scaling on an original image to generate a multi-scale pyramid image group, and then respectively detecting input images with different scales; specifically, when a near object is detected, the detection is performed on the reduced image; when detecting a distant object, the detection is performed on a large-size image with high resolution. Because of the need to design a training neural network for each image scale, higher demands are placed on the computational power and bandwidth of the device. How to reduce the requirement of target tracking on the computing power of equipment on the basis of ensuring the real-time accuracy of far and near target tracking is a technical problem to be solved urgently at present.

Disclosure of Invention

The invention aims to: the defects of the prior art are overcome, and the robot following method and the following robot based on the vision are provided. According to the invention, the detection input image is formed by splicing a plurality of images which have different resolutions and are related to the visual field input image so as to match the size requirement of the pedestrian detection neural network model, the large-resolution image is suitable for detecting a near target, the small-resolution image is suitable for detecting a far target, and the input images with different scales are not required to be detected respectively, so that the requirement of target tracking on the calculation force of equipment is reduced on the basis of ensuring the real-time accuracy of the far and near target tracking, and the power consumption is reduced.

In order to achieve the above object, the present invention provides the following technical solutions:

a vision-based robot following method comprising the steps of:

receiving target object information;

the method comprises the steps that a vision camera acquires a visual field input image, the visual field input image is preprocessed to generate a detection input image, and the detection input image is input into a pedestrian detection neural network model for detection; the detection input image is formed by splicing a plurality of images which have different resolutions and are related to the view field input image;

acquiring a pedestrian detection result, and determining a following target in the pedestrian according to the target object information;

and carrying out robot following on the following target.

On the other hand, the step of determining the following target in the pedestrian according to the target object information is as follows:

acquiring all pedestrian information of a pedestrian detection result;

selecting the pedestrians matched with the target object information from all the pedestrians, and mapping the pedestrian selection result to the view input image for output and display;

when only one pedestrian is selected, taking the pedestrian as a following target; when a plurality of selected pedestrians are selected, the selected pedestrians in the visual field input image are identified through the candidate frame, selection information of the user on the candidate frame is collected, and the pedestrians in the candidate frame selected by the user are used as the following targets.

On the other hand, the direction of collecting the selection information of the candidate box by the user is one of the following modes:

acquiring selection information of a user on a candidate frame through a display screen and an operation button on the robot, outputting a focus area on the display screen, and adjusting the position of the focus area through the operation button to select the candidate frame;

or outputting a candidate frame through a touch display screen on the robot, and acquiring a selection instruction of a user on the candidate frame through the touch display screen;

or sending the view input image containing the candidate frame to a remote terminal where the associated user is located, and acquiring a selection instruction of the associated user on the remote terminal for the candidate frame.

On the other hand, the target object information comprises face feature information and first following distance information of a target object, the face features are used as recognition features to construct a visual tracker, and the first following distance is kept between the visual tracker and the following target in the target following process.

On the other hand, in the following process, an image of the following target is acquired, the clothing characteristic information, the dressing characteristic information, the carried article characteristic information and/or the gait characteristic information of the following target are identified as target additional information, the target additional information is sent to the visual tracker to update the following target information, and the tracking direction and the tracking distance are adjusted.

On the other hand, in the following process, the following target is kept to be positioned in the central area of the visual field;

when the following target deviates, the deviation amount is compensated by controlling the robot to rotate, or the deviation amount is compensated by controlling the vision camera installed on the robot to rotate.

On the other hand, the resolutions of a plurality of spliced images forming the detection input image are different;

or, in a plurality of spliced images forming the detection input image, the resolutions of partial spliced images are the same.

In another aspect, the preprocessing the view input image to generate the detection input image includes:

taking a view input image as an original resolution image, and compressing the original resolution image according to two compression ratios to obtain two global mapping images with different resolutions; the size of the global mapping image with small resolution is smaller than the required size of the detection input image, and the size of the global mapping image with large resolution is larger than the required size of the detection input image;

selecting a global map with low resolution as a first splicing map of the detection input image, and subtracting the size of the first splicing map by the size of the detection input image to obtain the size of the residual region;

and setting one or more intercepting frames according to the size of the residual area, acquiring a high-resolution edge local image in the edge area of the global map with high resolution through the intercepting frames, filling the edge local image into the residual area, and splicing to form the detection input image.

taking a view input image as an original resolution image, and when the size of the original resolution image is judged to be larger than the required size of the detection input image, compressing the original resolution image according to a compression ratio to obtain a small-resolution global mapping image, wherein the size of the small-resolution global mapping image is smaller than the required size of the detection input image;

selecting a global mapping chart with small resolution as a first splicing chart of a detection input image, and subtracting the size of the first splicing chart from the size of the detection input image to obtain the size of a residual region;

one or more intercepting frames are set according to the size of the residual area, the edge local image with high resolution is obtained in the edge area of the original resolution image through the intercepting frames, and the edge local image is filled in the residual area for splicing to form the detection input image.

The invention also provides a visual following robot, which comprises the following structure:

a vision camera for taking an image as a visual field input image;

a processor comprising a pedestrian detection module and a target following module;

the pedestrian detection module is used for preprocessing the view input image to generate a detection input image, and inputting the detection input image into the pedestrian detection neural network model for detection; the detection input image is formed by splicing a plurality of images which have different resolutions and are related to the view field input image;

and the target following module is used for acquiring a pedestrian detection result, determining a following target in the pedestrian according to the target object information and performing robot following on the following target.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the detection input images are formed by splicing a plurality of images which have different resolutions and are related to the visual field input images so as to match the size requirement of a pedestrian detection neural network model, the large-resolution images are suitable for detecting near targets, the small-resolution images are suitable for detecting far targets, and the input images with different scales are not required to be detected respectively.

Drawings

Fig. 1 is a flowchart of a vision-based robot following method according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating several shapes of a capture box according to an embodiment of the present invention, wherein 2a illustrates a rectangular box, 2b illustrates an L-shaped box, and 2c illustrates an L-shaped box

Type 2d example □ type.

Fig. 3 is an information transmission diagram of a pedestrian detection process according to an embodiment of the present invention.

Detailed Description

The following describes the following robot and the following robot based on vision disclosed in the present invention with reference to the accompanying drawings and specific embodiments. It should be noted that technical features or combinations of technical features described in the following embodiments should not be considered as being isolated, and they may be combined with each other to achieve better technical effects. In the drawings of the embodiments described below, the same reference numerals appearing in the respective drawings denote the same features or components, and may be applied to different embodiments. Thus, once an item is defined in one drawing, it need not be further discussed in subsequent drawings. The drawings are only for purposes of illustration and description and are not intended to limit the scope of the invention, which is to be determined by the claims, the appended drawings, and all changes that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the claims.

Examples

Referring to fig. 1, a vision-based robot following method provided by the present invention includes the steps of:

step 1, receiving target object information.

Step 2, the vision camera acquires a visual field input image, the visual field input image is preprocessed to generate a detection input image, and the detection input image is input into a pedestrian detection neural network model for detection; the detection input image is formed by splicing a plurality of images which have different resolutions and are related to the view field input image.

And 3, acquiring a pedestrian detection result, and determining a following target in the pedestrian according to the target object information.

And 4, performing robot following on the following target.

In this embodiment, preferably, the target object information includes face feature information and first following distance information of the target object, the face feature is used as an identification feature to construct a visual tracker, and the first following distance is kept with the following target in the target following process. At this time, when the robot tracks the target, the robot may first perform the confirmation of the following target based on the side feature of the human face and/or the facial features of the five sense organs, and after confirming the following target, acquire other features of the target, such as gait features, clothing features, and the like, so as to facilitate the following behind.

Specifically, in the following process, after the target is confirmed through the face features, an image of the following target can be obtained, clothing feature information, dressing feature information, carried article feature information and/or gait feature information of the following target are identified as target additional information, the target additional information is sent to the visual tracker to update the following target information, and the tracking direction and the tracking distance are adjusted. Preferably, the tracking distance is adjusted to a second tracking distance, which is greater than the first tracking distance.

During the following, it is preferable to keep the following target located in the central region of the field of view. When the following target deviates, the deviation amount is compensated by controlling the robot to rotate, or the deviation amount is compensated by controlling the vision camera installed on the robot to rotate.

By adopting the technical scheme, the robot can determine the following target based on the human face features which have the significant features (facial features) and are easy to confirm the target identity, and then carry out rear following through other features of the following target, so that the following behavior can be hidden conveniently.

In this embodiment, preferably, the step of determining the following target in the pedestrian according to the target object information in step 3 specifically includes:

and step 31, acquiring all pedestrian information of the pedestrian detection result.

And 32, selecting the pedestrians matched with the target object information from all the pedestrians, and mapping the pedestrian selection result to the view input image for output and display.

When only one pedestrian is selected, the pedestrian is taken as a following target in step 331.

And step 332, when a plurality of selected pedestrians are selected, identifying the selected pedestrians in the visual field input image through the candidate frame, collecting selection information of the user on the candidate frame, and taking the pedestrians in the candidate frame selected by the user as a following target.

In this embodiment, preferably, the orientation for collecting the selection information of the candidate box by the user is one of the following manners.

The first method is as follows: the selection information of the user on the candidate frame is collected through a display screen and an operation button on the robot, a focus area is output on the display screen, and the position of the focus area is adjusted through the operation button to select the candidate frame.

For example, without limitation, 2 candidate frames are output on a display screen on the robot, and 5 operation buttons are provided on one side of the display screen, namely an up button, a down button, a left button, a right button, and a determination button located in the middle. The focus area is located by default on the candidate box closest to the center of the field of view, and the user can adjust the position of the focus area by pressing the up, down, left and right buttons. When the focus area is located on the candidate frame where the tracking target is located, the user may click the candidate frame through the aforementioned determination button as a selection operation. Or, the staying time of the focus area on the candidate frame is collected, and if the focus area is not moved by the user within the preset time range, the corresponding candidate frame can be determined as the candidate frame selected by the user.

Or outputting the candidate frame through a touch display screen on the robot, and acquiring a selection instruction of the user on the candidate frame through the touch display screen.

The remote terminal preferably adopts a mobile phone, a tablet computer and a wearable intelligent terminal, such as intelligent glasses and an intelligent watch. Therefore, the remote user can assist the robot in target tracking.

Preferably, the robot may start a video recording function during the following process, store video data of the video recording in an associated memory or a cloud server, and periodically send the video data to the terminal of the user. Further, the user can also send a real-time reference instruction to the robot through the terminal, and the robot sends the current real-time video data or screenshot to the terminal of the user according to the real-time reference instruction.

In this embodiment, the resolutions of the plurality of stitched images composing the detection input image may all be different. By way of example and not limitation, for example, the input image is detected to include 3 stitched images, and the resolutions of the 3 stitched images are different.

Or, in a plurality of spliced images forming the detection input image, the resolutions of partial spliced images are the same. By way of example and not limitation, detecting that the input image includes 3 stitched images, wherein the resolution of 2 stitched images is taken from images of the same resolution, and the resolutions of the two are the same. The resolution of the other 1 stitched image is different from the two.

In a preferred embodiment, the step of preprocessing the view input image to generate the detection input image may be as follows:

taking a view input image as an original resolution image, and compressing the original resolution image according to two compression ratios to obtain two global mapping images with different resolutions; the size of the global map image with small resolution is smaller than the required size of the detection input image, and the global map image with large resolution is larger than the required size of the detection input image.

And selecting the global map with low resolution as a first splicing map of the detection input image, and subtracting the size of the first splicing map by the size of the detection input image to obtain the size of the residual area.

The intercepting frame is fixed, for each frame of image, the edge local image with high resolution is only obtained in the fixed edge area of the global map with high resolution, the shape and the size of the intercepting frame are matched with the size of the residual area, and the size of the intercepting frame is larger than the minimum detection size of the pedestrian detection neural network model. Specifically, the shape of the intercepting frame can be set as a rectangular frame, an L-shaped frame, a rectangular,

Profiles (openings can be up, down, left or right) and □ profiles, see fig. 2.

Preferably, the intercepting frames are arranged to be rectangular frames, the rectangular frames can be arranged to be multiple according to the shape of the residual region, and the plurality of rectangular intercepting frames can form the shape of the residual region through edge splicing.

The fixed edge region may be a left edge region, a right edge region, an upper edge region and/or a lower edge region, preferably a right edge region and/or an upper edge region. Since the probability that a small object at a far position is located in the edge region of an image is greater than the probability that the small object at a far position is located in the middle region of the image (the center region of the field of view or the region close to the center region of the field of view) extending outward from the center of the field of view when the camera takes an image, that is, the probability that a small object at a far position is detected in the edge region of the image is greater than the probability that a small object at a far position is detected in the middle region of the image (a large object at a near position is more easily detected in the middle region of the image), when a large object at a far position is detected by a global map having a small resolution, the detection rate of.

The detection input image is a fixed input size, and the size of the detection input image input to the pedestrian detection neural network model needs to be consistent with the fixed input size. According to the size of the remaining region of the detection input image, one or more capture boxes can be set to obtain a local image in the edge region of the global map with high resolution (the image in the capture box is the captured local image). By adopting the detection input image with fixed size, the model training and the model design complexity of the pedestrian detection neural network model can be obviously simplified.

By way of example and not limitation, referring to fig. 3, for example, the width and height of the input image of the field of view is 1000 × 1000 pixels, that is, the resolution of the original resolution image is 1000 × 1000 pixels, the input size required for detecting the input image is 540 × 360 pixels, and the original resolution image is compressed at two compression ratios (compression at a high compression ratio) to obtain two global mapping images with different resolutions, that is, 300 × 300 pixels (compression ratio of 0.3) and 600 × 600 pixels (compression ratio of 0.6), where the former size is smaller than the required size for detecting the input image and the latter size is larger than the required size for detecting the input image.

And taking the global map with the resolution of 300-300 pixels as a first splicing map, and then carrying out splicing filling on the residual area by using a local image with large resolution at the edge of the first splicing map according to the size 540-360 pixels of the detected input image. The splice filling rule can be set by system default or user personalization, for example, the splice filling rule can be set as: the local image is stitched and filled based on the right edge of the first stitched image preferentially over the left edge, and the local image is stitched and filled based on the lower edge of the first stitched image preferentially over the upper edge. For example, the width and height of the 2 rectangular capture frames are 240 × 360 pixels and 300 × 60 pixels respectively, the captured image in the 240 × 360 capture frame is spliced to the left edge of the first mosaic, the width requirement of 540 for detecting the input image is met (300+240 × 540), the captured image in the 300 × 60 capture frame is spliced to the lower edge of the first mosaic for filling, the height requirement of 360 for detecting the input image is met (300+60 × 360), and a spliced image meeting the size requirement of detecting the input image is constructed.

It should be noted that, according to the shape of the size of the surplus region and the detection requirement, more rectangular capture frames may be provided as long as the plurality of rectangular capture frames can form the shape of the surplus region by edge stitching. However, when the rectangular capture frame is set, the number of rectangular frames is preferably set based on the rule of "minimizing rectangular frames participating in the splicing".

In another preferred embodiment, the step of preprocessing the view input image to generate the detection input image is as follows:

and taking the view input image as an original resolution image, and when the size of the original resolution image is judged to be larger than the required size of the detection input image, compressing the original resolution image according to a compression ratio to obtain a small-resolution global mapping image, wherein the size of the small-resolution global mapping image is smaller than the required size of the detection input image.

And selecting the global map with small resolution as a first splicing map of the detection input image, and subtracting the size of the first splicing map by the size of the detection input image to obtain the size of the residual area.

In another implementation manner of this embodiment, the fixed intercepting frame may be a sliding frame that can move regularly. Specifically, the sliding window may move to different positions on the designated image according to a preset movement rule, for example, the full map may be scanned at a constant speed from the top left corner of the designated image in the order from left to right and from top to bottom, or the full map may be scanned according to the order set by the user, or the full map may be scanned according to a random movement rule. In this way, complete detection of large-resolution images can be achieved.

At this time, the step of preprocessing the view input image to generate the detection input image may be as follows:

And setting one or more sliding frames according to the size of the residual area, wherein the sliding frames can be moved to different positions on the global map with high resolution according to a preset moving rule and frames, acquiring a local image of the global map on the global map with high resolution through the intercepting frame, and filling the local image into the residual area for splicing to form the detection input image.

The invention also provides a visual following robot based on the detection of the far and near targets.

The vision following robot comprises the following structures:

and a vision camera for capturing an image as a visual field input image.

And the processor comprises a pedestrian detection module and a target following module.

The pedestrian detection module is used for preprocessing the view input image to generate a detection input image, and inputting the detection input image into the pedestrian detection neural network model for detection; the detection input image is formed by splicing a plurality of images which have different resolutions and are related to the view field input image.

The target following module is configured to: acquiring all pedestrian information of a pedestrian detection result; selecting the pedestrians matched with the target object information from all the pedestrians, and mapping the pedestrian selection result to the view input image for output and display; when only one pedestrian is selected, taking the pedestrian as a following target; when a plurality of selected pedestrians are selected, the selected pedestrians in the visual field input image are identified through the candidate frame, selection information of the user on the candidate frame is collected, and the pedestrians in the candidate frame selected by the user are used as the following targets.

The processor also comprises an initialization setting module which is used for collecting target object information set by a user. Preferably, the target object information includes face feature information and first following distance information of the target object.

At this time, the target following module is configured to: constructing a visual tracker by taking the human face features as recognition features, and keeping the first following distance with a following target in the process of following the target; and acquiring an image of the following target in the following process, identifying clothing characteristic information, dressing characteristic information, carried article characteristic information and/or gait characteristic information of the following target as target additional information, sending the target additional information to a visual tracker for updating the following target information, and adjusting the tracking direction and the tracking distance.

The target following module also keeps the following target positioned in the central area of the visual field in the following process; when the following target deviates, the deviation amount is compensated by controlling the robot to rotate, or the deviation amount is compensated by controlling the vision camera installed on the robot to rotate.

Other technical features are described in the foregoing embodiments, and the processor or the module thereof may be configured to perform the information transmission and information processing functions described in the foregoing embodiments, which are not described in detail herein.

In the description above, the various components may be selectively and operatively combined in any number within the intended scope of the present disclosure. In addition, terms like "comprising," "including," and "having" should be interpreted as inclusive or open-ended, rather than exclusive or closed-ended, by default, unless explicitly defined to the contrary. While exemplary aspects of the present disclosure have been described for illustrative purposes, those skilled in the art will appreciate that the foregoing description is by way of description of the preferred embodiments of the present disclosure only, and is not intended to limit the scope of the present disclosure in any way, which includes additional implementations in which functions may be performed out of the order of presentation or discussion. Any changes and modifications of the present invention based on the above disclosure will be within the scope of the appended claims.

Claims

1. A vision-based robot following method, characterized by comprising the steps of:

receiving target object information;

and carrying out robot following on the following target.

2. The robot following method according to claim 1, wherein: the step of determining the following target in the pedestrian based on the aforementioned target object information is,

acquiring all pedestrian information of a pedestrian detection result;

3. The robot following method according to claim 2, wherein: the orientation of the selection information of the candidate box acquired by the user is one of the following ways:

4. The robot following method according to claim 1, wherein: the target object information comprises face feature information and first following distance information of a target object, the face features are used as recognition features to construct a visual tracker, and the first following distance is kept between the visual tracker and the following target in the target following process.

5. The robot following method according to claim 4, wherein: in the following process, an image of the following target is obtained, clothing characteristic information, dressing characteristic information, carried article characteristic information and/or gait characteristic information of the following target are identified as target additional information, the target additional information is sent to a visual tracker to update the following target information, and the tracking direction and the tracking distance are adjusted.

6. The robot following method according to claim 5, wherein: in the following process, keeping the following target in the central area of the visual field;

7. The robot following method according to claim 1, wherein: the resolution ratios of a plurality of spliced images forming the detection input image are different;

8. The robot following method according to claim 1, wherein: the preprocessing the view input image to generate the detection input image includes:

9. The robot following method according to claim 1, wherein: the preprocessing the view input image to generate the detection input image includes:

10. A vision following robot, characterized by comprising:

a vision camera for taking an image as a visual field input image;