CA3150597A1

CA3150597A1 - Pedestrian detecting method and device

Info

Publication number: CA3150597A1
Application number: CA3150597A
Authority: CA
Inventors: Yantao YIN; Jiang Liu; Yinjun HUANG; Huaiyuan JI; Wei Jing
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2021-03-02
Filing date: 2022-03-01
Publication date: 2022-09-02
Also published as: CN113065397B; CN113065397A

Abstract

The present invention discloses a pedestrian detecting method and a corresponding device, relates to the field of image recognition technology. The method comprises: creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask; respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera; and recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

Description

PEDESTRIAN DETECTING METHOD AND DEVICE
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the field of image recognition technology, and more particularly to a pedestrian detecting method and a pedestrian detecting device.
Description of Related Art

[0002] In this age of vigorous development of artificial intelligence, various new things come into being like mushrooms sprout after a spring rain, as unmanned supermarkets and unmanned stores emerge one another the other. With the advent of smart retail as the tidal waves of the day, offline retail is combined with artificial intelligence, and it has become a new orientation of research to provide a completely novel purchasing mode as smooth as online shopping. The imperceptible shopping experience of "take-and-go"
comes to its true meaning by providing in real time such services as commodities recommendation and settlement through full-coverage shooting of behavior track of every customer coming into a closed scenario.

[0003] The currently not so many pedestrian detecting schemes are all directed to relatively spatial scenarios in which shooting is inevitably obliquely down oriented, the advantage thereof is the larger projection area of shooting, thus facilitating to obtain more feature information, but the ensuing shielding problem also cannot be avoided. In such a complicated scenario as an unattended store or an unmanned supermarket, performance ill effects brought about by shielding might render it impossible for the entire system to normally operate, so that settlement on leaving the store and shopping experiences are adversely affected.

Date Recue/Date Received 2022-03-01 SUMMARY OF THE INVENTION

[0004] An objective of the present invention is to provide a pedestrian detecting method and a pedestrian detecting device, whereby the problem concerning the missing of shielded information due to oblique shooting via a single camera is effectively solved by collecting and monitoring pedestrian data within a scenario with specific angles by means of plural depth cameras, and precision in pedestrian detection data is enhanced.

[0005] In order to achieve the above objective, a first aspect of the present invention provides a pedestrian detecting method that comprises:

[0006] creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask;

[0007] respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera;

[0008] obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond;

[0009] splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera;

[0010] updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras, so as to update the top-view depth picture and the top-view color picture of each depth camera;

Date Recue/Date Received 2022-03-01

[0011] merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture; and

[0012] recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

[0013] Preferably, the step of creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera includes:

[0014] frame-selecting a ground region from the first depth image captured by each depth camera to create a ground fitting formula, and frame-selecting at least one marker region to create a marker fitting formula corresponding to the marker region in a one-to-one manner;

[0015] creating the ground mask corresponding to each depth camera according to the ground fitting formula, and creating the marker mask corresponding to each depth camera according to the marker fitting formula; and

[0016] merging the ground mask and the marker mask to form the background mask corresponding to each depth camera.

[0017] Preferably, the step of updating background masks on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera includes:

[0018] comparing depth values of pixels at various corresponding locations in the Mth frame of second depth image and the m+lth frame of second depth image captured by the same depth camera, where an initial value of m is 1;

[0019] recognizing pixels whose depth values are changed, updating the depth value of a pixel at a corresponding location in the m+lth frame of second depth image as a small value in the comparing result, let m = m+1, and comparing again depth values of pixels at various corresponding locations in the Mth frame of second depth image and the m+lth frame of Date Recue/Date Received 2022-03-01 second depth image, until pixels at various locations in the last frame of second depth image and their corresponding depth values are obtained;

[0020] comparing the pixels at various locations in the last frame of second depth image and their corresponding depth values with pixels at various locations in the background mask and their corresponding depth values; and

[0021] recognizing pixels whose depth values are changed, and updating the depth value of a pixel at a corresponding location in the background mask as a small value in the comparing result.

[0022] Preferably, the step of obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond includes:

[0023] creating a full-scene top-view depth background blank template picture and a full-scene top-view color background blank template picture, wherein depth values of pixels at various locations in the full-scene top-view depth background blank template picture are zero, and color values of pixels at various locations in the full-scene top-view color background blank template picture are zero;

[0024] merging and unifying pixels in the background masks to which the various depth cameras correspond to form a full-scene background mask, uniformly transforming the pixel coordinates to world coordinates, and then unifoimly transforming the world coordinates to top-view coordinates;

[0025] sequentially traversing pixels in the full-scene background mask, comparing a depth value of each pixel with depth values of pixels at corresponding locations in the full-scene top-view depth background blank template picture, and replacing pixels at corresponding locations in the full-scene top-view depth background blank template picture with large-value pixels in the full-scene background mask, to obtain a full-scene top-view depth background picture; and

[0026] on the basis of pixels to which replacement occurs in the full-scene top-view depth background mask, replacing pixels at corresponding locations in the full-scene top-view Date Recue/Date Received 2022-03-01 color background blank template picture with their pixel color values, to obtain a full-scene top-view color background picture.

[0027] Preferably, the step of splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera includes:

[0028] on the basis of top-view coordinates of pixels of the background mask to which each depth camera corresponds, splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera.

[0029] Further, the step of updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras includes:

[0030] comparing depth values of pixels in the third depth images obtained in real time by the depth cameras with depth values of pixels of the corresponding separate top-view depth background pictures;

[0031] employing a frame difference method to recognize pixels whose depth values are small values in the third depth images, and summarizing to obtain a foreground region that contains human body pixels;

[0032] correspondingly matching and associating pixels in the foreground region with pixels of the separate top-view depth background pictures in a one-to-one manner, and replacing depth values of the pixels in the separate top-view depth background pictures with depth values of the pixels in the corresponding foreground region; and

[0033] recognizing pixels to which replacement occurs in the separate top-view depth background pictures, and replacing corresponding pixels in the separate top-view color Date Recue/Date Received 2022-03-01 background pictures with color values of pixels in the foreground region.

[0034] Further, the step of merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture includes:

[0035] traversing pixels in the corresponding top-view depth picture of each depth camera, and replacing depth values of pixels at corresponding locations in the full-scene top-view depth background picture, to obtain a full-scene top-view depth picture; and

[0036] recognizing pixels to which replacement occurs in the full-scene top-view depth picture, and replacing color values of pixels at corresponding locations in the full-scene top-view color background picture, to obtain a full-scene top-view color picture.

[0037] Preferably, the step of recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture includes:

[0038] comparing pixels whose depth values are changed in the full-scene top-view depth picture and the full-scene top-view depth background picture, and on the basis of a dense region area of pixels and depth values of the various pixels, recognizing a head volume and/or a body volume; and

[0039] recognizing a pedestrian detecting result on the basis of sizes and/or a size of the head volume and/or the body volume.

[0040] As compared with prior-art technology, the pedestrian detecting method provided by the present invention achieves the following advantageous effects.

[0041] The pedestrian detecting method provided by the present invention can be divided into an algorithm preparation phase, an algorithm initialization phase and an algorithm detection application phase in the actual application, of which the algorithm preparation Date Recue/Date Received 2022-03-01 phase is also the phase of generating the background mask of each depth camera, and the specific process is as follows: a first depth image of the current detected scenario is firstly obtained by each depth camera that shoots the image overhead, a ground region and at least one marker region are frame-selected in the first depth image, a ground fitting formula corresponding to each depth camera and a corresponding marker fitting formula are created, and a ground mask established from the ground fitting formula and marker masks established from the various marker fitting formulae are then merged to obtain background masks corresponding to the various depth cameras in the current scenario.
The algorithm initialization phase is also a background mask updating phase, and the specific process is as follows: background update is performed on the background masks to which various depth cameras correspond on the basis of depth values of pixels in plural continuous frames of second depth images as obtained and depth values of pixels in the corresponding background mask, a full-scene top-view depth background picture and a full-scene top-view color background picture are subsequently obtained after coordinate-transforming and merging pixels in the various background masks, the full-scene top-view depth background picture is thereafter split into separate top-view depth background pictures corresponding to each depth camera, the full-scene top-view color background picture is split into separate top-view color background pictures corresponding to each depth camera, consequently, pixels in a foreground region are updated into the top-view depth background picture and the top-view color background picture of the corresponding depth camera on the basis of the foreground region that contains human body pixels in a third depth image obtained in real time by each depth camera, so as to update the top-view depth picture and the top-view color picture of each depth camera, finally, the top-view depth pictures of the various depth cameras are merged to form a full-scene top-view depth picture, and the top-view color pictures of the various depth cameras are merged to form a full-scene top-view color picture. The algorithm detection application phase is a human body region detecting phase, and its corresponding specific process is as follows: a pedestrian detecting result is comprehensively recognized by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth Date Recue/Date Received 2022-03-01 background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

[0042] As can be seen, the present invention utilizes specific viewing angles, such as the overhead shooting mode, to obtain depth images and establish background masks, solves the problem concerning information missing due to shielding by oblique shooting, and enlarges application scenarios of pedestrian detection; in addition, the use of depth cameras increases information dimensions of images as compared with the use of ordinary cameras, whereby data of 3D spatial coordinates including the human height and the head can be obtained, and precision in pedestrian detection data is enhanced. Through the distributed layout of multiple depth cameras, it is made possible to adapt to complicated monitored scenarios where a great deal of shielding is present, and the use of two-dimensional judging conditions as depth images and color images makes it possible to further enhance precision in pedestrian detection data.

[0043] The second aspect of the present invention provides a pedestrian detecting device that is applied to the pedestrian detecting method as recited in the aforementioned technical solution, and the device comprises:

[0044] a mask creating unit, for creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask;

[0045] a mask updating unit, for respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera;

[0046] a mask merging unit, for obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond;

[0047] a background splitting unit, for splitting the full-scene top-view depth background picture Date Recue/Date Received 2022-03-01 into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera;

[0048] a foreground recognizing unit, for updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras, so as to update the top-view depth picture and the top-view color picture of each depth camera;

[0049] a full-scene merging unit, for merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture; and

[0050] a pedestrian detecting unit, for recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

[0051] In comparison with prior-art technology, the advantageous effects achievable by the pedestrian detecting device provided by the present invention are identical with the advantageous effects achieved by the pedestrian detecting method as recited in the aforementioned technical solution, so no repetition is redundantly made thereto in this context.

[0052] The third aspect of the present invention provides a computer-readable storage medium storing thereon a computer program that executes the steps of the aforementioned pedestrian detecting method when it is operated by a processor.

[0053] In comparison with prior-art technology, the advantageous effects achievable by the computer-readable storage medium provided by the present invention are identical with Date Recue/Date Received 2022-03-01 the advantageous effects achieved by the pedestrian detecting method as recited in the aforementioned technical solution, so no repetition is redundantly made thereto in this context.
BRIEF DESCRIPTION OF THE DRAWINGS

[0054] The drawings described here are merely meant to supply further comprehension to the present invention, and constitute a portion of the present invention.
Exemplary embodiments of the present invention and descriptions thereof are meant to explain the present invention, rather than to restrict the present invention. In the drawings:

[0055] Fig. 1 is a flowchart schematically illustrating the pedestrian detecting method in Embodiment 1 of the present invention.
DETAILED DESCRIPTION OF THE INVENTION

[0056] In order to make apparent and comprehensible the aforementioned objectives, features and advantages of the present invention, the technical solutions in the embodiments of the present invention will be more clearly and comprehensively described below with reference to the accompanying drawings in the embodiments of the present invention.
Apparently, the embodiments as described are merely partial, rather than the entire, embodiments of the present invention. All other embodiments obtainable by persons ordinarily skilled in the art on the basis of the embodiments in the present invention without spending creative effort in the process shall all be covered by the protection scope of the present invention.

[0057] Embodiment 1

[0058] Please refer to Fig. 1, this embodiment provides a pedestrian detecting method that comprises:
Date Recue/Date Received 2022-03-01

[0059] creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask; respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera; obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond; splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera; updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras, so as to update the top-view depth picture and the top-view color picture of each depth camera; merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture; and recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

[0060] The pedestrian detecting method provided by this embodiment can be divided into an algorithm preparation phase, an algorithm initialization phase and an algorithm detection application phase in the actual application, of which the algorithm preparation phase is also the phase of generating the background mask of each depth camera, and the specific Date Recue/Date Received 2022-03-01 process is as follows: a first depth image of the current detected scenario is firstly obtained by each depth camera that shoots the image overhead, a ground region and at least one marker region are frame-selected in the first depth image, a ground fitting formula corresponding to each depth camera and a corresponding marker fitting formula are created, and a ground mask established from the ground fitting formula and marker masks established from the various marker fitting formulae are then merged to obtain background masks corresponding to the various depth cameras in the current scenario.
The algorithm initialization phase is also a background mask updating phase, and the specific process is as follows: background update is performed on the background masks to which various depth cameras correspond on the basis of depth values of pixels in plural continuous frames of second depth images as obtained and depth values of pixels in the corresponding background mask, a full-scene top-view depth background picture and a full-scene top-view color background picture are subsequently obtained after coordinate-transforming and merging pixels in the various background masks, the full-scene top-view depth background picture is thereafter split into separate top-view depth background pictures corresponding to each depth camera, the full-scene top-view color background picture is split into separate top-view color background pictures corresponding to each depth camera, consequently, pixels in a foreground region are updated into the top-view depth background picture and the top-view color background picture of the corresponding depth camera on the basis of the foreground region that contains human body pixels in a third depth image obtained in real time by each depth camera, so as to update the top-view depth picture and the top-view color picture of each depth camera, finally, the top-view depth pictures of the various depth cameras are merged to form a full-scene top-view depth picture, and the top-view color pictures of the various depth cameras are merged to form a full-scene top-view color picture. The algorithm detection application phase is a human body region detecting phase, and its corresponding specific process is as follows: a pedestrian detecting result is comprehensively recognized by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and Date Recue/Date Received 2022-03-01 pixels in the full-scene top-view color background picture.

[0061] As can be seen, this embodiment utilizes specific viewing angles, such as the overhead shooting mode, to obtain depth images and establish background masks, solves the problem concerning information missing due to shielding by oblique shooting, and enlarges application scenarios of pedestrian detection; in addition, the use of depth cameras increases information dimensions of images as compared with the use of ordinary cameras, whereby data of 3D spatial coordinates including the human height and the head can be obtained, and precision in pedestrian detection data is enhanced. Through the distributed layout of multiple depth cameras, it is made possible to adapt to complicated monitored scenarios where a great deal of shielding is present, and the use of two-dimensional judging conditions as depth images and color images makes it possible to further enhance precision in pedestrian detection data.

[0062] As should be noted, the first depth image, second depth image and third depth image in the above embodiment differ from one another only in terms of purposes of use, of which the first depth image is used to create the ground fitting formula and the marker fitting formula, the second depth image is used to update the background mask, and the third depth image is used to obtain a real-time detected image of human body detection data.
For instance, the first frame of image obtained through overhead shooting of a monitored region by a depth camera is taken to serve as the first depth image, the second to the hundredth frames of depth images are taken to serve as second depth images, after the background mask has been updated to completion, the real-time image obtained through overhead shooting of the monitored region by the depth camera is taken to serve as the third depth image.

[0063] In this embodiment, the step of creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera includes:

[0064] frame-selecting a ground region from the first depth image captured by each depth camera Date Recue/Date Received 2022-03-01 to create a ground fitting formula, and frame-selecting at least one marker region to create a marker fitting formula corresponding to the marker region in a one-to-one manner;
creating the ground mask corresponding to each depth camera according to the ground fitting formula, and creating the marker mask corresponding to each depth camera according to the marker fitting formula; and merging the ground mask and the marker mask to form the background mask corresponding to each depth camera.

[0065] During specific implementation, explanation is now made with an example of creating a background mask for a first depth image captured by one of the depth cameras.
The method of creating a ground fitting formula based on the ground region frame-selected from the first depth image includes:

[0066] Sll ¨ making statistics on a data collection corresponding to the ground region, the data collection including a plurality of pixel coordinates and depth values corresponding thereto;

[0067] S12 ¨ randomly selecting n pixels from the ground region to create a ground initial dataset, where n>3 and n is an integer;

[0068] 513 ¨ creating an initial ground fitting formula based on the currently selected n pixels, traversing pixels not selected in the initial dataset, and sequentially substituting the pixels in the initial ground fitting formula to calculate ground fitting values of the corresponding pixels;

[0069] 514 ¨ screening out ground fitting values that are smaller than a first threshold, and generating ith round of effective ground fitting value collection, where the initial value of i is 1;

[0070] S15 ¨ when a ratio of the number of pixels to which the ith round of effective ground fitting value collection corresponds to the total number of pixels in the ground region is greater than a second threshold, accumulating the entire ground fitting values in the ith round of effective ground fitting value collection;

[0071] 16¨S
when the accumulating result of the entire ground fitting values in the ith round is smaller than a third threshold, the initial ground fitting formula to which the ith round Date Recue/Date Received 2022-03-01 corresponds is defined as the ground fitting formula, when the accumulating result of the entire ground fitting values to which the ith round corresponds is greater than the third threshold, let i = i+1, and returning to step S12 when i does not reach a threshold number of rounds, otherwise executing step S17; and

[0072] S17 ¨ defining the initial ground fitting formula, to which the minimum value of the accumulating results of the entire ground fitting values in all rounds corresponds, as the ground fitting formula.

[0073] The method of creating a corresponding marker fitting formula based on the marker region includes:

[0074] 521 ¨making statistics on a data collection corresponding to the marker region in a one-to-one manner, the data collection including a plurality of pixel;

[0075] S22 ¨ randomly selecting n image points from the marker region to create a marker initial dataset, where n>3 and n is an integer;

[0076] S23 ¨ creating an initial marker fitting formula based on the currently selected n pixels, traversing pixels not selected in the initial dataset, and sequentially substituting the pixels in the initial marker fitting formula to calculate marker fitting values of the corresponding pixels;

[0077] S24 ¨ screening out marker fitting values that are smaller than a first threshold, and generating ith round of effective marker fitting value collection, where the initial value of i is 1;

[0078] S25 ¨ when a ratio of the number of pixels to which the ith round of effective marker fitting value collection corresponds to the total number of pixels in the marker region is greater than a second threshold, accumulating the entire marker fitting values in the ith round of effective marker fitting value collection;

[0079] S26 ¨ when the accumulating result of the entire marker fitting values in the ith round is smaller than a third threshold, the initial marker fitting formula to which the ith round corresponds is defined as the marker fitting formula, when the accumulating result of the entire marker fitting values to which the ith round corresponds is greater than the third Date Recue/Date Received 2022-03-01 threshold, let i = i+1, and returning to step S22 when i does not reach a threshold number of rounds, otherwise executing step S27; and

[0080] S27 ¨ defining the initial marker fitting formula, to which the minimum value of the accumulating results of the entire marker fitting values in all rounds corresponds, as the marker fitting formula.

[0081] Explanation is made below with the marker fitting formula as an example: a ground region is firstly frame-selected through an interactive mode set by a program, a data collection is screened out to contain only ground image points, three pixels are thereafter randomly selected to create a ground initial dataset, and an initial ground fitting formula is fitted by employing a plane formula, aix + biy + ciz + di = 0, where i represents the serial number of a depth camera, if only one depth camera is used in the full scene, then i is valuated as 1, that is to say, the ground fitting formula is created only with respect to the first depth image captured by this one depth camera; if w depth cameras are used in the full scene, the valuation of i is traversed respectively through 1 to w, that is to say, corresponding ground fitting formulae should be created one by one with respect to first depth images captured by the w depth cameras.

[0082] After the initial ground fitting formula has been created, pixels not selected in the initial dataset are traversed (except for the three already selected pixels), world coordinate values (x, y, z) to which each pixel corresponds are sequentially substituted in the initial ground fitting formula (I axi + byi+ czi+ d1iI) to calculate ground fitting values error current to which the traversed pixels correspond, the ground fitting values that are smaller than a first threshold e are screened out to form an effective ground fitting value collection corresponding to this round of initial ground fitting formula, when a ratio of the number of corresponding pixels in this round of effective ground fitting value collection to the total number of pixels in the ground region is greater than a second threshold d, the entire ground fitting values in this round of effective ground fitting value collection are accumulated to obtain a result error sum, and when error sum<error best Date Recue/Date Received 2022-03-01 in this round, where error best is a third threshold, the ground fitting formula is created on the basis of the values of a, b, c, d in this round of initial ground fitting formula, whereas when error sum>error best in this round, the above steps should be repeated to enter the next round, i.e., three image points are selected anew to create a ground initial dataset, initial ground fitting formulae are created and a result of accumulating the entire ground fitting values in this round is obtained, until the initial ground fitting formula to which the minimum value of the result of accumulating the entire ground fitting values in all rounds corresponds is defined as the ground fitting formula.

[0083] Through the above process it is made possible to effectively avoid interference from some abnormal points, and the ground fitting formula as calculated is more fit to the ground; in addition, since the values of a, b, c, d in the ground fitting formula are calculated by a random consistency algorithm, the resultant ground fitting formula can be used as the optimal model of the ground region in the first depth image, the interference of abnormal points is effectively filtered out, and the established ground equation is prevented from deviating from the ground.

[0084] By the same token, the process of creating the marker fitting formula is logically consistent with the process of creating the ground fitting formula, so it is not redundantly described in this embodiment, but as should be stressed that, since there are usually more than one marker region, so marker fitting formulae should correspond to the plural marker regions in a one-to-one manner.

[0085] In this embodiment, the method of merging the ground mask and the marker mask to form a background mask corresponding to each depth camera includes:

[0086] creating a ground equation on the basis of the ground fitting formula, and creating a marker equation on the basis of the marker fitting formula; traversing pixels in the first depth image, and respectively substituting the pixels in the ground equation and the marker equation to obtain ground distances and marker distances of the pixels;
screening Date Recue/Date Received 2022-03-01 out the pixels whose ground distances are smaller than a ground threshold to be filled as the ground mask, and screening out the pixels whose marker distances are smaller than a marker threshold to be filled as the marker masks; and merging the ground mask and the entire marker masks to obtain a background mask to which the depth camera under the current scenario corresponds.
laxt-EbYt+czt+dt1 =

[0087] During specific implementation, a general equation distance = is Va2+b2+c2 employed to respectively calculate the ground equation and the marker equation, when the numerator I axi + by i + czi + diI is a ground fitting formula, and when a, b,c in the denominator are values in the ground fitting formula, this equation represents a ground equation, when the numerator I axi + byi + czi + diI is a marker fitting formula, and when a, b, c in the denominator are values in the marker fitting formula, this equation represents a marker equation. After the ground equation and the marker equation have been created to completion, ground distances and marker distances of the entire pixels in the first depth image are obtained by traversing the pixels and respectively substituting the pixels in the ground equation and the marker equation, the pixels whose ground distances are smaller than a ground threshold are screened out to be filled as the ground mask, and the pixels whose marker distances are smaller than a marker threshold are screened out to be filled as the marker mask.

[0088] Exemplarily, the ground threshold and the marker threshold are both set as 10cm, that is to say, the region within 10cm of the ground is defined as a ground mask, the region within 10cm of the marker is defined as a marker mask, and finally the regions of the ground mask and the entire marker masks are defined as the background mask of the current scenario. Through the creation of the background mask, it is made possible to effectively filter out noises on the marker region(s) and the ground region, and to solve the problem concerning reduction in algorithm performance caused by noises generated by depth cameras shooting these regions. For instance, the marker is a shelf.

Date Recue/Date Received 2022-03-01

[0089] In this embodiment, the method of updating the background mask on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera includes:

[0090] comparing depth values of pixels at various corresponding locations in the Mth frame of second depth image and the m+lth frame of second depth image captured by the same depth camera, where an initial value of m is 1; recognizing pixels whose depth values are changed, updating the depth value of a pixel at a corresponding location in the m+lth frame of second depth image as a small value in the comparing result, let m =
m+1, and comparing again depth values of pixels at various corresponding locations in the Mth frame of second depth image and the m+lth frame of second depth image, until pixels at various locations in the last frame of second depth image and their corresponding depth values are obtained; comparing the pixels at various locations in the last frame of second depth image and their corresponding depth values with pixels at various locations in the background mask and their corresponding depth values; and recognizing pixels whose depth values are changed, and updating the depth value of a pixel at a corresponding location in the background mask as a small value in the comparing result.

[0091] During specific implementation, internal parameters and external parameters of each depth camera are firstly calibrated to perform transformation of the image from two-dimensional coordinates to three-dimensional coordinates, so that relevant calculations are made through practical physical meanings. Subsequently, each depth camera is used to continuously capture 100 frames of second depth images, and background update is performed on the background mask with respect to the 100 frames of second depth images captured by each depth camera. The updating process is as follows: by comparing the depth values of pixels (row, col) at various identical locations in the 100 frames of second depth images, the minimum values of the corresponding depth values of pixels (row, col) at each identical location are screened out of the 100 frames of second depth images, so that the corresponding depth values of pixels (row, col) at various locations in the 100 frames of second depth images as output are all minimum values in the 100 frames of Date Recue/Date Received 2022-03-01 second depth images; such setup aims as follows: since the depth cameras employ the overhead shooting scheme, when a moving object (such as a passing pedestrian) appears in the second depth images, the depth values of pixels at corresponding locations become larger, by taking the minimum values of the corresponding depth values of pixels at identical locations in the 100 frames of second depth images, it is made possible to effectively prevent the second depth images from being interfered with passing objects that occasionally appear, and avoid the appearance of pixels of passing objects in the background mask. Thereafter, the pixels at various locations in the 100 frames of second depth images and their corresponding depth values are compared with pixels at various locations in the background mask and their corresponding depth values, pixels whose depth values are changed are recognized, and the depth values of pixels at corresponding locations in the background mask are updated as small values in the comparing result, so as to ensure precision of the updated background mask.

[0092] In this embodiment, the method of obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond includes:

[0093] creating a full-scene top-view depth background blank template picture and a full-scene top-view color background blank template picture, wherein depth values of pixels at various locations in the full-scene top-view depth background blank template picture are zero, and color values of pixels at various locations in the full-scene top-view color background blank template picture are zero; merging and unifying pixels in the background masks to which the various depth cameras correspond to form a full-scene background mask, uniformly transforming the pixel coordinates to world coordinates, and then uniformly transforming the world coordinates to top-view coordinates;

sequentially traversing pixels in the full-scene background mask, comparing a depth value of each pixel with depth values of pixels at corresponding locations in the full-scene top-view depth background blank template picture, and replacing pixels at corresponding Date Recue/Date Received 2022-03-01 locations in the full-scene top-view depth background blank template picture with large-value pixels in the full-scene background mask, to obtain a full-scene top-view depth background picture; and on the basis of pixels to which replacement occurs in the full-scene top-view depth background mask, replacing pixels at corresponding locations in the full-scene top-view color background blank template picture with its pixel color values, to obtain a full-scene top-view color background picture.

[0094] During specific implementation, depth values of pixels at various locations in the created full-scene top-view depth background blank template picture are zero, namely back depth(row, col) = 0, color values of pixels at various locations in the created full-scene top-view color background blank template picture are zero, namely back color(row, col) = [0,0,0], thereafter, the pixels in the background masks to which the various depth cameras correspond are merged, that is to say, the pixels in the background masks to which the various depth cameras correspond are unifofinly expressed by the same and single pixel coordinate system to form a full-scene background mask, the various pixels in the full-scene background mask are then uniformly transformed via pixel coordinates to world coordinates, and subsequently uniformly transformed from the world coordinates to top-view coordinates under the current monitored scenario ¨ the coordinates transforming process is well known to persons skilled in the art, and is not redundantly described in this embodiment. Consequently, a pixel comparison formula current depth(row, col) > back depth(row, col) is employed to compare the depth value of each pixel [current depth(row, col)] in the full-scene background mask with the depth value of a pixel [back depth(row, col)] at the corresponding location in the full-scene top-view depth background blank template picture, a full-scene top-view depth background picture formula back depth(row, col) = current depth(row, col) is employed to replace pixels at corresponding locations in the full-scene top-view depth background blank template picture with large-value pixels in the full-scene background mask, to obtain a full-scene top-view depth background picture, and a full-scene top-view color background picture formula back color(row, col) = current color(row, col) is employed Date Recue/Date Received 2022-03-01 to replace pixels at corresponding locations in the full-scene top-view color background blank template picture with color values of pixels to which replacement occurs in the full-scene top-view depth background mask, to obtain a full-scene top-view color background picture.

[0095] Understandably, current depth(row, , col) represents the depth values of pixels in the full-scene background mask, back depth(row, , col) represents the depth values of pixels in the full-scene top-view depth background blank template picture, the formula back depth(row, , col) = current depth(row, col) represents assigning the depth value of a pixel at a certain coordinate location in the full-scene background mask to the pixel at the corresponding location in the full-scene top-view color background blank template picture, namely to replace the pixel at the corresponding location in the full-scene top-view depth background blank template picture; by the same token, current color(row, col) represents the color values of pixels in the full-scene background mask, back color(row, col) represents the color values of pixels in the full-scene top-view color background blank template picture, the formula back color(row, col) =
current color(row, col) represents assigning the color value of a pixel at a certain coordinate location in the full-scene background mask to the pixel at the corresponding location in the full-scene top-view color background blank template picture. A
full-scene top-view depth background picture and a full-scene top-view color background picture are formed until the various pixels have all been traversed.

[0096] In this embodiment, the method of splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera includes:

[0097] on the basis of top-view coordinates of pixels of the background mask to which each depth camera corresponds, splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and Date Recue/Date Received 2022-03-01 splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera.

[0098] During specific implementation, sensor depth[k] represents separate top-view depth background pictures to which the kth depth camera corresponds, back depth represents the full-scene top-view depth background picture, the formula sensor depth[k]frow, col) = back depth(row, col) is employed to split the full-scene top-view depth background picture into separate top-view depth background pictures to which the kth depth camera corresponds, in which back depth(row, col) represents the depth value of a certain coordinate pixel in the full-scene top-view depth background picture, sensor depth[k]frow, col) represents the depth value of a pixel at a certain coordinate location in the separate top-view depth background pictures to which the kth depth camera corresponds, the formula sensor depth[k]frow, col) = back depth(row, col) represents assigning the depth value of a pixel at a certain coordinate location in the full-scene background mask to the pixel at the corresponding location in the separate top-view depth background pictures to which the kth depth camera corresponds; by the same token, sensor color[k] represents separate top-view color background pictures to which the kth depth camera corresponds, back color represents the full-scene top-view color background picture, the formula sensor color[k]frow, col) = back color(row, col) is employed to split the full-scene top-view color background picture into separate top-view color background pictures to which the kth depth camera corresponds.

[0099] In this embodiment, the method of updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras includes:

[0100] comparing depth values of pixels in the third depth images obtained in real time by the depth cameras with depth values of pixels of the corresponding separate top-view depth Date Recue/Date Received 2022-03-01 background pictures; employing a frame difference method to recognize pixels whose depth values are small values in the third depth images, and summarizing to obtain a foreground region that contains human body pixels; correspondingly matching and associating pixels in the foreground region with pixels of the separate top-view depth background pictures in a one-to-one manner, and replacing depth values of the pixels in the separate top-view depth background pictures with depth values of the pixels in the corresponding foreground region; and recognizing pixels to which replacement occurs in the separate top-view depth background pictures, and replacing corresponding pixels in the separate top-view color background pictures with color values of pixels in the foreground region. Seen as such, through such a frame difference method, it is made possible to effectively filter out noises from the third depth images obtained in real time, and enhance precision in foreground region recognition.

[0101] During specific implementation, in order to reduce the number of pixels, the voxel filtering method can be employed to filter the pixels to reduce the number of pixels and to enhance computing speed. Exemplarily, the voxel size is set as vox size =
(0.1,0.1,0.1), and the sparse outlier removing method is employed to filter out partial pixels based on distances between adjacent pixels and the multiples of standard deviations, so as to effectively reduce interference from outlying noises.

[0102] In this embodiment, the method of merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture includes:

[0103] traversing pixels in the corresponding top-view depth picture of each depth camera, and replacing depth values of pixels at corresponding locations in the full-scene top-view depth background picture, to obtain a full-scene top-view depth picture; and recognizing pixels to which replacement occurs in the full-scene top-view depth picture, and replacing color values of pixels at corresponding locations in the full-scene top-view color Date Recue/Date Received 2022-03-01 background picture, to obtain a full-scene top-view color picture.

[0104] In this embodiment, the method of recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture includes:

[0105] comparing pixels whose depth values are changed in the full-scene top-view depth picture and the full-scene top-view depth background picture, and on the basis of a dense region area of pixels and depth values of the various pixels, recognizing a head volume and/or a body volume; and recognizing a pedestrian detecting result on the basis of sizes and/or a size of the head volume and/or the body volume.

[0106] During specific implementation, considering that there might be the case of unintentional captures in the detecting result, it is possible to filter out such captures according to actual physical features, by transforming the full-scene top-view depth picture to the actual world coordinates, the physical volume of the human body, such as the physical volume of the human head, etc., is calculated in the foreground region in conjunction with a human body detection frame, for instance, boundary lengths and widths of the human body and the human head are calculated on the basis of pixel coordinates, and the physical volume of the human body and the physical volume of the human head are then calculated and obtained in combination with depth values.

[0107] If Vbody max > Vbody> Vbody min is satisfied, human body volume requirement is met,

[0108] If Vheard max > Vhead> Vhead Mill is satisfied, human head volume requirement is met,

[0109] where Vbody represents the physical volume of the human body as detected, Vhead represents the physical volume of the human head as detected, Vbody max and Vboa), min represent preset upper limit and lower limit of recognition of the physical volume of the human body, and Vhead max and Vhead min represent preset upper limit and lower limit of recognition of the physical volume of the human head. If only the human body while no Date Recue/Date Received 2022-03-01 human head is detected in the full-scene top-view depth picture, a human-head searching mode is started to automatically search for the human head frame in the full-scene top-view depth picture through an algorithm. Through the human head frame searching function, it is made possible to effectively call back the human head frame missing in the full-scene top-view depth picture, and to thus enhance algorithm stability.

[0110] During specific implementation, boundary pixels of the foreground region in the full-scene top-view depth picture are recognized through the frame difference method, i.e., the foreground region of the human body ROT is represented by bird _depth_map _mask _r oi , and the formula bird _depth_map _mask_roi =
bird_depth_map_mask[row _min: row _max, col_min: col_max] is employed to recognize the boundary pixels of the foreground region, where row min and row max represent the upper limit and the lower limit of the pixels at the x axis, and col min:
col max represent the upper limit and the lower limit of the pixels at they axis. Moreover, in order to accelerate computation, it is possible to employ the mode of calculating integral graph accumulation, namely to accumulatively add the depth values of plural pixels, until the location of a human head frame is demarcated on reaching a threshold range. Thereafter, head points are searched in the human head frame, in other words, the head points are circled within the human head frame to traverse movements, and the head points region is searched in the human head frame on the basis of the ratio of the foreground pixels in the head points circle to the entire pixels in the circle. Through the aforementioned head-point searching mechanism, it is made possible to effectively filter out the interference from noises, avoid instability of head points caused by noises, and to prevent noises from causing adverse effect on the body height and subsequent tracking.

[0111] Subsequently, it is further possible to base on an average value of depth values of pixels of the head region in the head points region to calculate the human body height by laxt bYt czt dti employing the formula distance = Va2+b2+c2 , and 2D or 3D head-point coordinates.

Date Recue/Date Received 2022-03-01

[0112] In summary, this embodiment exhibits the following creative aspects:

[0113] Through the distributed layout of multiple depth cameras, it is made possible to adapt to complicated monitored scenarios where a great deal of shielding is present, by slightly partially overlapping specific viewing angles of the depth cameras, it is possible to make maximal use of the coverage area of the viewing angles of the cameras, and to obtain a full-scene top-view depth picture of the entire monitored scenario in conjunction with merging rules.

[0114] The use of RGBD depth cameras increases information dimensions, a full-scene top-view depth picture of particular viewing angles is obtained by merging depth information, and a full-scene top-view color picture of particular viewing angles is obtained by merging color information. Pedestrian detection can be effectively performed through the full-scene top-view color picture, secondary verification can be performed on the detecting result in conjunction with depth information, and such information as relevant to height can be obtained.

[0115] The use of the merging mode whereby foreground and background are respectively merged can reduce the merging of irrelevant background, effectively enhance the overall merging time, and hence enhance algorithm performance.

[0116] The use of refined and simplified algorithmic logics, for instance, the use of human head frame searching function can avoid the circumstance in which the pedestrian cannot be subsequently tracked due to absence of the head frame, enhance robustness of the algorithms.

[0117] The solution of the above embodiment separately carries out foreground detection, and a full-scene top-view picture of the entire scenario is finally integrated via a merging module, whereby waste of computational resources can be effectively reduced, and computing speed is enhanced.

[0118] Embodiment 2 Date Recue/Date Received 2022-03-01

[0119] This embodiment provides a pedestrian detecting device that comprises:

[0120] a mask creating unit, for creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask;

[0121] a mask updating unit, for respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera;

[0122] a mask merging unit, for obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond;

[0123] a background splitting unit, for splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera;

[0124] a foreground recognizing unit, for updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras, so as to update the top-view depth picture and the top-view color picture of each depth camera;

[0125] a full-scene merging unit, for merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture; and

[0126] a pedestrian detecting unit, for recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

Date Recue/Date Received 2022-03-01

[0127] In comparison with prior-art technology, the advantageous effects achievable by the pedestrian detecting device provided by this embodiment of the present invention are identical with the advantageous effects achieved by the pedestrian detecting method provided by the aforementioned Embodiment 1, so no repetition is redundantly made thereto in this context.

[0128] Embodiment 3

[0129] This embodiment provides a computer-readable storage medium storing thereon a computer program that executes the steps of the aforementioned pedestrian detecting method when it is operated by a processor.

[0130] In comparison with prior-art technology, the advantageous effects achievable by the computer-readable storage medium provided by this embodiment are identical with the advantageous effects achieved by the pedestrian detecting method provided by the aforementioned technical solution, so no repetition is redundantly made thereto in this context.

[0131] As comprehensible to persons ordinarily skilled in the art, the entire or partial steps realizing the method of the present invention can be completed through a program that instructs relevant hardware, the program can be stored in a computer-readable storage medium, and subsumes the various steps of the embodied method when it is executed, and the storage medium can be ROM/RAM, a magnetic disk, an optical disk, a memory card, etc.

[0132] What is described above is merely directed to specific embodiments of the present invention, but the protection scope of the present invention is not restricted thereby. Any variation or replacement easily conceivable to persons skilled in the art within the technical range disclosed by the present invention shall be covered within the protection Date Recue/Date Received 2022-03-01 scope of the present invention. Accordingly, the protection scope of the present invention shall be based on the Claims.
Date Recue/Date Received 2022-03-01

Claims

What is claimed is:

1. A pedestrian detecting method, characterized in comprising:
creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask;
respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera;
obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond;
splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera;
updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras, so as to update the top-view depth picture and the top-view color picture of each depth camera;
merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture; and recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

2. The method according to Claim 1, characterized in that the step of creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera includes:
frame-selecting a ground region from the first depth image captured by each depth camera to create a ground fitting formula, and frame-selecting at least one marker region to create a marker fitting formula corresponding to the marker region in a one-to-one manner;
creating the ground mask corresponding to each depth camera according to the ground fitting formula, and creating the marker mask corresponding to each depth camera according to the marker fitting formula; and merging the ground mask and the marker mask to form the background mask corresponding to each depth camera.

3. The method according to Claim 1, characterized in that the step of updating background masks on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera includes:
comparing depth values of pixels at various corresponding locations in the Illth frame of second depth image and the m+ lth frame of second depth image captured by the same depth camera, where an initial value of m is 1;
recognizing pixels whose depth values are changed, updating the depth value of a pixel at a corresponding location in the m+ lth frame of second depth image as a small value in the comparing result, let m = m+1, and comparing again depth values of pixels at various corresponding locations in the Illth frame of second depth image and the m+lth frame of second depth image, until pixels at various locations in the last frame of second depth image and their corresponding depth values are obtained;
comparing the pixels at various locations in the last frame of second depth image and their corresponding depth values with pixels at various locations in the background mask and their corresponding depth values; and recognizing pixels whose depth values are changed, and updating the depth value of a pixel at a corresponding location in the background mask as a small value in the comparing result.

4. The method according to Claim 1, characterized in that the step of obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond includes:
creating a full-scene top-view depth background blank template picture and a full-scene top-view color background blank template picture, wherein depth values of pixels at various locations in the full-scene top-view depth background blank template picture are zero, and color values of pixels at various locations in the full-scene top-view color background blank template picture are zero;
merging and unifying pixels in the background masks to which the various depth cameras correspond to form a full-scene background mask, unifoimly transforming the pixel coordinates to world coordinates, and then uniformly transforming the world coordinates to top-view coordinates;
sequentially traversing pixels in the full-scene background mask, comparing a depth value of each pixel with depth values of pixels at corresponding locations in the full-scene top-view depth background blank template picture, and replacing pixels at corresponding locations in the full-scene top-view depth background blank template picture with large-value pixels in the full-scene background mask, to obtain a full-scene top-view depth background picture; and on the basis of pixels to which replacement occurs in the full-scene top-view depth background mask, replacing their pixel color values to pixels at corresponding locations in the full-scene top-view color background blank template picture, to obtain a full-scene top-view color background picture.

5. The method according to Claim 4, characterized in that the step of splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera includes:
on the basis of top-view coordinates of pixels of the background mask to which each depth camera corresponds, splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera.

6. The method according to Claim 5, characterized in that the step of updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras includes:
comparing depth values of pixels in the third depth images obtained in real time by the depth cameras with depth values of pixels of the corresponding separate top-view depth background pictures;
employing a frame difference method to recognize pixels whose depth values are small values in the third depth images, and summarizing to obtain a foreground region that contains human body pixels;
correspondingly matching and associating pixels in the foreground region with pixels of the separate top-view depth background pictures in a one-to-one manner, and replacing depth values of the pixels in the separate top-view depth background pictures with depth values of the pixels in the corresponding foreground region; and recognizing pixels to which replacement occurs in the separate top-view depth background pictures, and replacing color values of pixels in the foreground region to corresponding pixels in the separate top-view color background pictures.

7. The method according to Claim 6, characterized in that the step of merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture includes:
traversing pixels in the corresponding top-view depth picture of each depth camera, and replacing depth values of pixels at corresponding locations in the full-scene top-view depth background picture, to obtain a full-scene top-view depth picture; and recognizing pixels to which replacement occurs in the full-scene top-view depth picture, and replacing color values of pixels at corresponding locations in the full-scene top-view color background picture, to obtain a full-scene top-view color picture.

8. The method according to Claim 7, characterized in that the step of recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture includes:
comparing pixels whose depth values are changed in the full-scene top-view depth picture and the full-scene top-view depth background picture, and on the basis of a dense region area of pixels and depth values of the various pixels, recognizing a head volume and/or a body volume;
and recognizing a pedestrian detecting result on the basis of sizes and/or a size of the head volume and/or the body volume.

9. A pedestrian detecting device, characterized in comprising:
a mask creating unit, for creating a background mask corresponding to each depth camera according to a first depth image captured by each depth camera, wherein the background mask includes a ground mask and a marker mask;
a mask updating unit, for respectively updating background masks to which various depth cameras correspond on the basis of pixels in plural frames of second depth images continuously captured by each depth camera, and pixels in the background mask corresponding to each depth camera;
a mask merging unit, for obtaining a full-scene top-view depth background picture and a full-scene top-view color background picture after coordinate-transforming and merging pixels in the background masks to which the various depth cameras correspond;
a background splitting unit, for splitting the full-scene top-view depth background picture into separate top-view depth background pictures corresponding to each depth camera, and splitting the full-scene top-view color background picture into separate top-view color background pictures corresponding to each depth camera;
a foreground recognizing unit, for updating pixels in a foreground region into the top-view depth background picture and the top-view color background picture of the corresponding depth camera by recognizing the foreground region that contains human body pixels in third depth images obtained in real time by the various depth cameras, so as to update the top-view depth picture and the top-view color picture of each depth camera;
a full-scene merging unit, for merging the top-view depth pictures of the various depth cameras to form a full-scene top-view depth picture, and merging the top-view color pictures of the various depth cameras to form a full-scene top-view color picture; and a pedestrian detecting unit, for recognizing a pedestrian detecting result by comparing pixels in the full-scene top-view depth picture and pixels in the full-scene top-view depth background picture, and comparing pixels in the full-scene top-view color picture and pixels in the full-scene top-view color background picture.

10. A computer-readable storage medium, storing a computer program thereon, characterized in that the method steps according to any of Claims 1 to 8 are realized when the computer program is executed by a processor.