WO2023147892A1

WO2023147892A1 - Long-term accurate crowd estimation in smart cities

Info

Publication number: WO2023147892A1
Application number: PCT/EP2022/071653
Authority: WO
Inventors: Gurkan Solmaz; Flavio CIRILLO
Original assignee: NEC Laboratories Europe GmbH
Priority date: 2022-02-04
Filing date: 2022-08-02
Publication date: 2023-08-10

Abstract

The invention provides systems and a computer-implemented methods for crowd estimation in an environment. The method comprises collecting, via accurate sensing capabilities of a mobile intelligent agent (202), environmental data and processing the collected environmental data for vehicle, people, and/or object recognition; using wireless sensors (252) deployed in the environment to collect wireless signals from the environment; training, during a training phase, a machine learning model to learn correlations between the collected environmental data and the collected wireless signals; and performing, during an operational phase, crowd estimation based on collected wireless signals only and by calibrating the collected wireless signals using the trained machine learning model.

Description

LONG-TERM ACCURATE CROWD ESTIMATION IN SMART CITIES

The present invention relates to apparatus and computer-implemented methods for crowd estimation in an environment, such as a city or any other defined area of interest.

The project leading to this application has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871249.

The crowd estimation problem has been one of the key problems in the domains such as smart cities and smart buildings. Various commercial applications already deploy crowd estimation systems (for reference, e.g., see https://www.stuff.co.nz/dominion-post/news/107217782/big-brother-has-left-the- capital--for-the-time-being). The applications of crowd estimation include urban city planning, epidemic monitoring, tourism services, transportation, smart parking (for reference, e.g., see Anonymous. ’’Smart City Crowd Parking”, published on CableLabs), and others.

The current state-of-the-art crowd estimation models mostly consider either fully- camera based solutions or purely Wi-Fi based solutions. Recently, a hybrid solution has been proposed that is based on correlation of Wi-Fi signals with image processing and that tries to learn correlations between the Wi-Fi and images using stereoscopic cameras (for reference, see Fang-Jing Wu and Gurkan Solmaz. “Crowdestimator: Approximating crowd sizes with multi-modal data for internet-of- things services”, in Proceedings of the 16^th Annual International Conference on Mobile Systems, Applications, and Services, pp. 337-349, 2018). However, this solution requires always cameras to be activated and statically deployed.

In short, the problem is to use Wi-Fi measurements for long-term and accurate crowd estimation in smart cities. Wi-Fi measurements can be collected without much cost and can be integrated to cheaper devices and vastly deployed in urban environments. As such, it is an object of the present invention to improve and further develop a method and a system of the initially described type in such a way that a high accuracy of crowd estimation is achieved in a cost effective and privacy-preserving way.

In accordance with the invention, the aforementioned object is accomplished by a computer-implemented method for crowd estimation in an environment, the method comprising: receiving, from a mobile intelligent agent having accurate sensing capabilities, environmental data and processing the received environmental data for vehicle, people, and/or object recognition; using wireless sensors deployed in the environment to collect wireless signals from the environment; training, during a training phase, a machine learning model to learn correlations between the collected environmental data and the collected wireless signals; and performing, during an operational phase, crowd estimation based on collected wireless signals only and by calibrating the collected wireless signals using the trained machine learning model.

The aforementioned object is further accomplished by an apparatus for crowd estimation in an environment, the apparatus comprising one or more processors which, alone or in combination, are configured to provide for execution of receiving, from a mobile intelligent agent having accurate sensing capabilities, environmental data and processing the received environmental data for vehicle, people, and/or object recognition, training, during a training phase, a machine learning model to learn correlations between the collected environmental data and wireless signals collected from the environment via wireless sensors deployed in the environment, and performing, during an operational phase, crowd estimation based on collected wireless signals only and by calibrating the collected wireless signals using the trained machine learning model.

According to embodiments, the present invention provides a crowd estimation method/apparatus based on Wi-Fi that addresses the problems of accuracy and adaptability. The method/apparatus applies machine learning for calibrating Wi-Fi scanners, wherein, during the training period, the method/apparatus may learn correlations between noisy Wi-Fi probe packets with Received Signal Strength Indicator (RSSI) and people counting cameras based on temporal features such as day of week and hour of the day. During an operational crowd estimation period, the method/apparatus may leverage only Wi-Fi measurements and calibrate them using the trained machine learning model. The crowd estimation period may span much longer, such as half a year, compared to the shorter training period spanning, e.g. days or weeks. The application of machine learning enables the possibility of removing advanced people counting sensors for the crowd estimation to minimize privacy invasion, deployment costs, and energy consumption.

The invention provides for long-term and accurate crowd estimation in an environment. According to embodiments, the invention provides a system comprising two collaborative components: 1) An intelligent mobile calibration agent, and 2) a digital twin of the environment (e.g. smart city). The mobile calibration agent may trigger new calibration tasks by its movements and the digital twin may simulate the wireless networking characteristics of the environment. The combination of the digital twin and the intelligent agent in accordance with embodiments of the present invention allows generating synthetic data for simulation, which leads to more accurate crowd estimations. The main technical advantages of the system include higher accuracy crowd estimation, energy consumption reduction, and enhanced privacy preservation (by not relying on tracking individuals and removing the need for using cameras constantly).

In an embodiment, through the collaborative usage of a mobile calibration agent and a digital twin for crowd estimation, the higher accuracy crowd estimation may come due to the following main reasons:

1) Additional data may be created by Digital Twin: Data augmentation may be achieved using synthetic network simulation data in the digital twin.

2) Intelligent agent’s inputs to Digital Twin: The intelligent mobile agent may feed the digital twin with environmental information. This information may be leveraged by the Digital Twin simulation for realistic network simulations.

3) Efficient planning of calibration agent: Mobile calibration agent may take into account crowd estimation machine learning outputs from the digital twin, and may make movement optimization to enable higher accuracy in a shorter time.

This would also in turn feed the Digital Twin for higher accuracy simulations.

Thus, according to aspects of the invention, the method/system includes the collaborative usage of the network simulation and the mobile agent movement optimization for crowd estimation. Smart cities or larger scale cities or regions are considered as possible application domains of the crowd estimation. For smart buildings, the proposed invention may provide occupancy estimation. Lastly, the crowd estimation can be applied to large indoor areas such as train stations or airports.

According to an embodiment, the method may further comprise generating, by a digital twin platform based on the results of the environmental data processing together with the collected wireless signals, a simulation of the wireless networking characteristics of the environment.

According to an embodiment, the method may further comprise dividing the environment into a number of calibration areas. Then a ML-based crowd estimator may perform, based at least on the simulation of the wireless networking characteristics of the environment, a crowd estimation for each calibration area and determine for each crowd estimation a value indicative of a certainty of the respective crowd estimation and a value indicative of a temporality of the calibration used for the respective crowd estimation.

According to an embodiment, the mobile intelligent agent may use the results of the crowd estimations performed by the ML-based crowd estimator to adapt the collecting of environmental data.

According to an embodiment, the network simulator of the digital twin platform may be configured to perform the steps of simulating the environment assuming each person would have mobile device(s); collecting simulated wireless measurements of a respective calibration area; matching the wireless signals collected from the wireless sensors with the simulated wireless measurements; adjusting the simulation of the environment based on the matched wireless measurements; estimating the differences between real measurements and adjusted simulated measurements; and outputting the simulated wireless measurements and the estimated differences as inputs for the ML-based crowd estimator.

According to an embodiment, the ML-based crowd estimator of the digital twin platform, for performing crowd estimations, may take into account estimations of crowdedness from a semantic scene simulator of the digital twin platform that simulates the environment using semantic information regarding events taking place within the environment.

According to an embodiment, the mobile intelligent agent may include a calibration planner that uses the crowd estimations of the digital twin platform’s ML-based crowd estimator to devise adapted movements of the mobile intelligent agent within the environment in order to collect environmental data from within specific calibration areas.

According to an embodiment, the calibration planner may be realized, e.g., by means of a Q-learning-based implementation.

According to an embodiment, the calibration planner, for devising adapted movements of the mobile intelligent agent within the environment, may take into account travel times between calibration areas and cross-correlation factors between calibration areas.

According to an embodiment, the method may further comprise using assistant sensors deployed in the environment including at least one of humidity, noise, and CO₂ sensors to collect auxiliary sensing information from the environment. The digital twin platform may the use the auxiliary sensing information together with the collected wireless signals for the simulation of the wireless networking characteristics of the environment.

There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end, it is to be referred to the dependent claims on the one hand and to the following explanation of preferred embodiments of the invention by way of example, illustrated by the figure on the other hand. In connection with the explanation of the preferred embodiments of the invention by the aid of the figure, generally preferred embodiments and further developments of the teaching will be explained. In the drawing

Fig. 1 is a schematic view illustrating the basic concept of long-term and accurate crowd estimation by short-term training of crowd estimation with Wi-Fi and people counting cameras in accordance with an embodiment of the present invention,

Fig. 2 is a diagram illustrating a machine learning method to train crowd estimation in accordance with an embodiment of the present invention,

Fig. 3 is a schematic view illustrating the main building blocks of a system for crowd estimation in accordance with an embodiment of the present invention,

Fig. 4 is a diagram illustrating a graph representation of the calibration problem of an intelligent calibration agent in accordance with an embodiment of the present invention, and

Fig. 5 is a schematic view illustrating an intelligent agent with accurate sensing moving in a smart city scenario in accordance with an embodiment of the present invention.

In recent years, the shift from rural to urban areas, namely urbanization, caused most cities around the world to experience a growth at rapid rate in terms of the number of people they accommodate. As adaptation to this phenomenon, investments on loT technologies seek the development of smart and sustainable cities. Examples of loT solutions include data-driven smart city planning, innovative transport systems for efficient and comfortable commute, and autonomous driving cars that adapt to the surrounding environment. The mentioned city services address the citizens’ needs and, therefore, understanding where and how people gather in urban areas is of utmost importance for effective city services.

There are several approaches for crowd estimation in urban areas. One approach is to conduct surveys at busy subway platforms and ask passengers the density of people they expect during their journey and then manually verify the survey results. Large-scale deployment of such manual counting and verification is labor-intensive. Furthermore, considering the dynamic changes in the urban environments, the provided measurements may not be valid after a while.

There exist computer vision-based solutions (for reference see, e.g., Z. Ma, X. Wei, X. Hong, and Y. Gong, “Bayesian loss for crowd count estimation with point supervision”, in IEEE/CVF ICCV, 2019, pp. 6142-6151 ) as interesting alternatives for automating crowd size estimation. However, camera-based solutions are privacy-invasive and energy-hungry.

There exist studies that explore non-image-based localization solutions. These solutions usually require people’s active participation in process by either using a smartphone application or voluntarily carrying specialized devices with in-built sensors such as RFID tags, and Bluetooth. Several studies propose crowd estimation using Wi-Fi scanners with Received Signal Strength Indicator (RSSI) and Channel State Information (CSI) for crowd estimation. However, existing solutions suffer from accuracy due to lack of adaptation to the dynamic changes in the cities.

The present invention provides crowd estimation methods and systems based on Wi-Fi that address the problems of accuracy and adaptability. In an embodiment, the method applies machine learning for calibrating Wi-Fi scanners 102 as schematically illustrated in Fig. 1. During a training period 100, the system learns the correlations between noisy Wi-Fi probe packets with Received Signal Strength Indicator (RSSI) and people counting cameras 104 (as disclosed, e.g., in HELLA Aglaia People Sensing Technologies, “Advanced People Sensor APS-180E,” http://people-sensing.com/, 2017) based on temporal features such as day of week and hour of the day. During an operational crowd estimation period 110, the method leverages only Wi-Fi measurements 102 and calibrates them using the machine learning model 120 trained during the training period 100. In other words, for the operational period 110, machine learning 120 may replace the ground-truth cameras 104 thanks to the initial training period 100. The crowd estimation period 110 may span much longer, such as half a year, compared to the shorter training period 102 spanning, e.g., two weeks. The application of machine learning enables the possibility of removing advanced people counting sensors 104 for the crowd estimation, thereby minimizing privacy invasion, deployment costs, and energy consumption.

Fig. 2 is a diagram illustrating an ML method to train crowd estimation in accordance with an embodiment of the present invention. In brief, the method includes preprocessing (i.e. , DB processing, filtering techniques), feature selection, and final training of an ML model such as polynomial regression or neural networks. More specifically, Fig. 2 illustrates a functional view of: 1) a training period to determine the relationship between Wi-Fi probe packets (PPs) and the near-ground truth people counting data; and 2) use of a calibration model to predict using the trained machine learning model.

The training period involves data collection for raw Wi-Fi and camera data in parallel. As the raw Wi-Fi data is noisy in its nature and off-the-shelf devices do not accurately count people using Wi-Fi, the method may include the four tasks listed on the left side of Fig. 2 before training or using the calibration model. The off-the- shelf people counting camera provides a calibration interface, and it has built-in image processing and people counting modules.

With regard to database processing, two CouchDB databases may be provided that store raw Wi-Fi PP details and camera data respectively. Wi-Fi PPs include RSSI measurements whereas the camera data includes people count-in and count-out events for different time intervals. The separation of databases is necessary during collection of data because the frequency with which the respective sensors transmit data can range from a few seconds to over a minute. With limited queue size in the cache, it is necessary that the server is able to forward the data to the database quickly and maintain consistency. Since different devices send out PPs at different intervals, it is not feasible to synchronize all device data on the fly. Therefore, once the raw sensor data coming from Wi-Fi and cameras are stored, a process may run over the raw databases and generate a new database with a structure suitable for time-series calculations in the later stages of the architecture.

While performing crowd estimation analysis, it is important to not count the same device multiple times that can lead to erroneous outputs and outright incorrect predictions. Therefore, once the two raw databases store all the raw data for a chosen time interval (e.g. for 1 hour), Wi-Fi PP data and people count from the camera are aggregated, e.g. for every minute. Accordingly, at the end of this stage, each generated document refers to a 1-min time period. One document records the details of all unique devices detected by the Wi-Fi scanners and the people counted by the stereoscopic cameras.

With regard to RSSI and bucket filtering, it is important to note that RSSI values range from -100dB to OdB according to IEEE 802.11 system standard. The closer the value is to 0, the stronger the signal strength is of the received Wi-Fi packet. When the signal is strong, the device is highly likely closer to the scanner than a device that sent a Wi-Fi packet with a low RSSI value. However, differences in RSSI might be due to different Wi-Fi hardware. To rely on the relationship between RSSI strength and device distance, records with too low RSSI may be filtered to reduce noise. In a test implementation, the RSSI threshold was set empirically to -87 dB beyond which received PPs are ignored. This is to minimize the noise created by devices considered too far away to the scanners.

In the bucket filtering phase, the aim is to remove static devices from the environment. Static devices are devices such as printers, scanners, and other electronic devices which may broadcast Wi-Fi PPs continuously for as long as they have Wi-Fi activated, but their PPs should not be included when predicting the people count. Therefore, to get a better estimate of crowd size from the wireless data, it is necessary to remove PPs coming from these static devices. To differentiate between these appliances from those owned by individual people, one first needs to understand the relationship between number of observations coming from a device with respect to time. The frequency with which a device emits a probe packet can range from a few seconds to a minute apart. The maximum frequency for 1 -minute period is therefore 60/4 = 15 and the minimum frequency is 1 . Knowing this, buckets of observations were made, where each bucket carries all PPs whose unique hashed identity was observed with the maximum observation count of the upper limit of the bucket. For instance, if the bucket represents a range of [0, 50], all PPs whose hashed ID was detected a maximum of 50 times in the training period will be included in the bucket. All other packets will be ignored as belonging to static devices. The algorithm then chooses the bucket which maximizes the training accuracy, and consider all devices with observations greater than the bucket maxima as belonging to static devices.

By this phase, the data pre-processing steps to format data on a per-minute basis and reduce noise from Wi-Fi data is completed. Next, the process of training feature selection will be described in some more detail. To begin training the actual model, the training phase needs to be modelled into an optimization problem to solve. The model then tries to generate a mapping function between the time-series raw PPs count as input and near-ground-truth people count observed by the camera as the dependent variable. The equation for the initial optimization problem is as follows:

where t is the time period, y'(t) is calibrated people count, f(x) denotes the calibration function, x(t) represents the raw Wi-Fi PPs count, y(t) represents near-ground-truth people count detected by camera, E is the error value, and N is the number of time periods.

In the collected data, it could be observed that there is a daily pattern of people movement in the area. Hence, using the temporal feature such as “hour of the day” as input feature would be helpful to detect the pattern of crowd size throughout the day. Similarly, using “day of the week” as another temporal feature would be helpful to accommodate the slight variation weekends and other weekly events may have on the crowd size patterns. The exact timestamp of the data points was not included because trying to observe patterns at the minutes or seconds level result in overfitting the model. Taking these observations into account, the problem can be modified using temporal features:

where h is hour of day, d is day of week, H is the total number of hours where the calibration takes place.

In accordance with embodiments of the present invention, two machine learning approaches may be implemented for the crowd estimation optimization problem:

• Polynomial regression models are considered fitting due to the nature of the problem for crowd size estimation.

• Neural network models can be capable of learning more complex patterns from the time-series data.

For both these approaches, the dataset was divided into training, validation and testing set. Several ratios between training and testing were tried out to ensure best trade-off between accuracy and training data size.

1) Polynomial regression models: To apply a regression model, one needs to make sure that the training data does not have bias due to the trends and seasonality. Further, it should be ensured that the data is stationary that is important when doing time-series analysis. The choice of the right regression model has a huge impact on how well the model is able to fit the data and generalize. Appropriate models include, e.g., Ridge regression and Lasso regression.

2) Neural network models: Neural networks, in principle, are designed to model non- linear associations automatically. These non-linearities usually require explicit transformations for a regression model to accommodate such relationships. Thus, neural networks approach suits the problem of crowd counting which depends on a variety of factors that might contribute to these non-linear relationships. But it is fairly common for a neural network model to overfit, especially if training data is small and the model architecture keeps growing or number of epochs are increased beyond an optimal number.

Therefore, a test case was started with a small architecture and layers and nodes were added after empirical analysis until out-of-sample prediction performance starts to go down. The ReLLI activation was used in order to ensure non-saturation of gradients which greatly speeds up the convergence of stochastic gradient descent (SGD). Further, empirical observations were used to select the learning rate in order to incur into sub-optimal models. A custom root mean squared error (RMSE) loss function as described in the above equation was employed to compute the gradients. For the optimization function, different models were trained with the Adam optimizer as well as SGD with Nesterov momentum.

In an embodiment, the prerequisites of a system for crowd estimation include sensor devices deployed to and distributed in the environment and at least one mobile agent with more accurate sensing capabilities. Agent’s movement decisions may trigger automatic calibration data collection and calibration for crowd estimation through learning correlations between different sensor modalities of the mobile agent’s sensors and the statically deployed sensor devices. It is assumed that the mobile agent has more reliable sensing devices (e.g., 360° cameras, Lidar, etc.), whereas the sensors statically deployed in the environment (e.g., a city) contain less reliable sensors (e.g., Wi-Fi or Bluetooth antennas). On the other hand, the static sensors have substantially more availability in long-time periods as well as energy efficiency.

Fig. 3 schematically illustrates the main building blocks of a system 200 for crowd estimation in accordance with embodiments of the present invention. As shown, the physical system components include an intelligent agent 202 performing flexible calibration planning, large-scale sensors 204 (e.g., Wi-Fi or Bluetooth antennas) and a digital twin 206. Both the intelligent agent 202 and the digital twin 206 support the wireless sensing of crowd estimation. According to the embodiment illustrated in Fig. 3, the intelligent agent 202 comprises a training agent, denoted Intelligent Crowd Estimation T raining Agent 230 herein. The training agent 230 is a mobile entity, for instance, a robot that has automated movement capabilities such as an autonomous vehicle. The agent 230 may have an internal operating system for computation and it may have communication capabilities using wireless communication. As shown under the bracket in Fig. 3, the agent 230 may comprise multiple modules that run on the operating system. An important component among these modules is a “Flexible Calibration Planner” module 240, which is described in more detail below.

According to embodiments of the invention, the training agent 230 may include one or more of the following additional components:

Vehicle Recognition module 232. This module may be configured to leverage intelligent agent’s 202 sensors (not explicitly shown in Fig. 3), such as cameras or Lidar, in order to recognize the existence and locations of any vehicles. For instance, this module can use a video processing engine, e.g. the Yolo real-time object detection engine (as described, e.g., at https://pjreddie.com/darknet/yolov2/). First, a video processing engine may filter the vehicles from all detected objects and then map the virtual location on the image (e.g., bounding box defined by Yolo) into real coordinates. As an outcome, the vehicle recognition module 232 may provide information on the existence and locations of vehicles.

People Recognition module 234. This module may be configured to operate in a similar way as the vehicle recognition module 232 as it may also leverage the sensors on the intelligent agent 202. On the other hand, it may leverage participatory information shared by vulnerable road users (VRUs) such as pedestrians or bicycle riders. As an outcome, the people recognition module 234 may provide information on the existence and locations of people.

Object Recognition module 236. This module is responsible to detect any other kind of objects that are relevant to the crowd estimation problem. The classification of objects is one of the key problems for object recognition as candidate classes would significantly affect the performance. For instance, the same object can be classified as “object” or “traffic sign”, or more particularly “traffic light”. On the other hand, in the context of the present invention, object recognition may be implemented to mainly consider those objects that might have any impact on the crowdedness of the given environment, such as “buildings” or “bus stops”.

Vehicle Route Planner module 238. This module may be configured to take the outputs of Flexible Calibration Planner 240 and, based on the given plan, to decide on the next route. The route may consist of a finite set of motions. More particularly, motion planning can be considered. In general, the module 238 can be considered as the “route” planner, which plans the movement routes of the intelligent agent 202.

Image Processing module 242. This module may be configured to process images in terms of their computation and to create a set of images or video features. The module may be responsible for any lower-layer processing of images that might be useful for the upper-layer recognition engines 232, 234 and 236. Signal processing techniques are considered in the image processing module 242.

As shown in Fig. 3, the system 200 may further include various large-scale sensors 204 deployed in the environment such as a smart city. According to embodiments of the invention, these sensors 204 are, as compared to the sensors of the intelligent agent 202, relatively cheaper and easy to deploy without much configuration effort. On the other hand, they lack accuracy in terms of providing crowd estimation.

The sensors 204 may include multiple sensors (including different sensor technologies) deployed widely in a large area of interest. In an embodiment, the sensors 204 may include wireless sensors 252 and assistant sensors 254.

Wireless Sensors 252 may include sensors that comprise a wireless antenna and an internal processing module. The wireless antenna collects wireless signals from the environment. The range and precision of the measurements depend on various factors (e.g., size and model of the antenna) and techniques such as channel hopping and beamforming. The internal processing module is responsible for decrypting the signals and creating data packets that can be later consumable by the digital twin 206. The wireless measurements can be using different protocols such as Wi-Fi and Bluetooth. The wireless sensors 252 can capture Wi-Fi probes or Bluetooth advertisement packets that are broadcasted from mobile devices in the environment. The range of the measurements depend on the receiver antenna, whereas in general it is supposed to be up to 100 m. Wireless signals are noisy and sparse in their nature, thus relying only on the sensing from wireless signals will mostly do not lead to high-accuracy crowd estimations.

Assistant Sensors 254 include sensors that are considered to provide “auxiliary” sensing information from the environment, such as CO₂ levels or noise in the environment. There are various environmental sensors that are already deployed in smart cities such as humidity, noise, CO₂, and so on. These sensors 254 can be located together with the wireless sensors 252 or separately in a vicinity location. As will be appreciated by those skilled in the art, larger spatial differences between the assistant sensors 254 and the wireless sensors 252 could possibly create problems in terms of the reliability of the data of the assistant sensors 254. Similar to the wireless sensors 252, the assistant sensors 254 may not be designed to be capable of providing high-accuracy crowd estimation, whereas they may be useful to have environmental hints.

As shown in Fig. 3, the system 200 may further include a digital twin module 206 that simulates the wireless networking characteristics of respective environment. In an embodiment, the digital twin module 206 may comprise a digital twin platform 250, which is a server-side system where the digital twin resides. This platform system 250 may comprise dedicated hardware including one or more processors for computation and communication requirements of the digital twin and software modules on top of the hardware. The platform 250 is responsible for data harmonization and Al. Furthermore, the platform 250 may have other capabilities that are illustrated as separate components below in the bracket in Fig. 3. According to the illustrated embodiment, these further capabilities include the following components:

A Wireless Data component 212 may refer to a database that consists of the datasets coming from the wireless sensors 252 as well as from the assistant sensors 254. The component is denoted ‘wireless’ herein for simplicity, as the wireless sensors 254 are the main source of data for crowd estimation.

A Localization Engine 214 is considered as a set of localization techniques using wireless and other datasets, including contextual datasets. For instance, the localization techniques may include some of the techniques developed and implemented as a localization platform in the LOCUS EU project (for reference, see https://www.locus-project.eu/). The localization engine 214 can use wireless data through triangulation, deviceless localization (by adding wireless transceivers to the sensors) or other techniques. Furthermore, the localization engine 214 can use video/image data for localizing people, vehicle, or other objects. As shown in Fig. 3 and as described in more detail below, the outputs of the localization engine 214 may be fed to an ML-based crowd estimator 210 and to a network simulator 208.

A Semantic Scene Simulator 216 may be configured to simulate the environment using all semantic information and estimates crowdedness based on that. For instance, multiple events from City Data Services 220 with expected number of people as well as real-time logs from transportation services may be used to simulate a certain environment and estimate crowdedness based on that. The semantic scene simulator 216 may be considered to provide higher-level probabilistic estimations through simple heuristics, whereas the network simulator 208 may be considered more lower-level that considers various network layers and the wireless channels for realistic simulation of the wireless characteristics. Thus, although they are both simulators they may be implemented as separate components in the digital twin 206.

City Data Services 220 can feed the digital twin platform 250 with data records from the smart city. For instance, any events that are planned/happening in the city and their expected locations can be accessed by the digital twin through the city data services 220.

Event Subscriber 218 is a module in the digital twin 206 that subscribes to specific events (e.g., open-air festival, sports event) form the city data services 220. The information received from the events may be sent to the Semantic Scene Simulator 216 for further processing.

According to an embodiment of the invention, the intelligent agent 202 may be configured to perform flexible calibration planning. To this end, the agent 202 may take crowd estimation outputs from the crowd estimator module 210 of the digital twin 206 as inputs for the certainty. Furthermore, the agent 202 may be given a graph model of the environment, e.g. as shown in Fig. 4.

In this model 300, as exemplary illustrated in Fig. 4, each node of the graph represents a calibration area 302 where the crowd estimation happens. Each area 302 can be considered based on an uncertainty of the ML-based crowd estimator 210 as well as a temporality of the calibration. These two characteristics: 1) crowd estimation certainty, and 2) temporality of the calibration can be represented as a tuple [x,y] normalized for simplicity x,y ∈ [0,1]. Exemplary values are indicated for each of the calibration areas 1-5 in Fig. 4. For instance, in the best case, the value of tuple is [1,1], meaning the certainty is 100% and temporality is also 100% (i.e. , the calibration area is very recently visited for a long duration). The temporal behavior can be considered “fulfilled” (having up-to-date temporal information) as long as the calibration happens. Otherwise, the temporal behavior can be considered “not fulfilled” (not having up-to-date information).

Furthermore, as shown in Fig. 4, each of the calibration areas 302 can be crosscorrelated to any other of the calibration areas 302. This cross-correlation may be realized according to the principles described in Fang-Jing Wu and Giirkan Solmaz. “Crowdestimator: Approximating crowd sizes with multi-modal data for internet-of- things services.” In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pp. 337-349. 2018, which is hereby incorporated by reference herein). In that case, calibration of one area 302 might also partially fulfill calibration of a respective cross-correlated area 302. In Fig. 4, cross-correlation links are exemplarily shown as edges between two nodes (dashed lines). As also illustrated in Fig. 4 (solid lines), there might be a certain travel time w between the calibration areas 302. A longer travel time w would lead to a loss of calibration performance as well as more energy consumption. Thus, the task of the intelligent calibration planner is to optimize based on the crowd estimation calibration performance and, possibly, its energy consumption. The agent 202 may be configured to flexibly change its decisions based on the changes happening in the dynamic graph 300.

According to an embodiment of the invention, the intelligent agent’s 202 calibration planner 240 may be realized by means of a Q-learning-based implementation. For executing its calibration task, the intelligent agent 202 may perform adapted movements for which various optimization algorithms can be considered. A possible optimization algorithm will be described hereinafter based on pseudocode using the following notation:

- N = {A^{( 1)}, A⁽²⁾,...,A^(|N|)}: Set of nodes (i.e. calibration areas 302)

- E= {W^{(i, j)}, W^{(j, i)} ... } : Travel times between calibration areas 302 (dynamic and bidirectional)

- C = { C^{(i, j)}, ... }: Cross-correlation factors between calibration areas 302

(dynamic)

- A(i) = [x^{(i )},y⁽ⁱ⁾]: Area i; with crowd estimation certainty x and calibration temporality y (dynamic x, y)

- Q(A⁽ⁱ⁾, m^{(i, j)}) q-values considering actions m^{(i, j)} from the state

- Q = {q ^{(i, j)}.. }. : Set of q values

The algorithm may receive the following parameters as input:

- N ← Set of nodes (i.e. calibration areas 302)

- A⁽ⁱ⁾ = [x⁽ⁱ⁾,y⁽ⁱ⁾] ← Current area of the agent 202 and its status in terms of certainty and temporality

- E ← Initial travel times between nodes (main edges)

- C ← Initial cross-correlation factors between nodes (secondary edges)

- Q ← Zeros(|N|, |N|)

- ε ← Exploration threshold - ← Learning rate

- γ ← Discount factor

- δ_E, δ_C, δ_x, δ_y ← Weights for rewarding based on travel times, cross-correlation, certainty, and temporality

- λ_x, λ_y ← Movement decision threshold based on certainty and temporality

The algorithm may yield the following output:

- m^{(i, j)} ← movement action by the intelligent agent 202

According to an embodiment, the optimization algorithm may be implemented based on the following pseudocode: r = UniformRandom(0,1) # initialize a random variable

If (x⁽ⁱ⁾ < λ_x ,_y(i) < λ_y]):

- return m^(i,i) # Decision to stay in the current area A⁽ⁱ⁾

Else If (r < ε):

_ W^{(i, j)}=Random({∀W^{(i, k)} | W^{(i, k)}∈ E}) # Select a random edge from A⁽ⁱ⁾

- return m^{(i, j)}

Else:

For each possible movement m^{(i, j)} from A⁽ⁱ⁾ s.t. W^{(i, j)}∈ E: # Calculate reward

■

Take movement decision m^{(i, j)} from A⁽ⁱ⁾ :

-

—End of pseudocode—

The above movement decision algorithm may be repeated for every movement decision (movement step). “Reward” calculation can take into account the certainty and temporality ([x,y]) values as well as travel time and cross-correlation between areas 302. The certainty and temporality can be provided by the digital twin’s 206 ML-based crowd estimator 210. Lastly, a decision option can be to stay in the same state (from A⁽ⁱ⁾ to A⁽ⁱ⁾, given by movement m^{(i, i)} based on certain thresholds to be satisfied.

Next, network simulation as performed by the digital twin component 206 will be described in more detail. The “urban” digital twin 206 may be configured to collect information from the real environment (e.g., from calibration areas 302). According to an embodiment of the invention, information may be collected from a minimum of two sources, namely from 1) sensing devices of the mobile agent 202, and from 2) statically deployed sensors 204. From (2), the localization for the people in the environment can be performed using various techniques such as triangulation. Similarly, from (1), people can be located using existing techniques. For instance, for the image data collection, the virtual location of a person in the image, camera angle and GPS coordinate of the vehicle can be used to localize the person. The localization may feed network simulator 208 of the digital twin component 206. Furthermore, the network simulator 208 may take the environmental inputs from the agent 202 such as existing buildings, obstacles, and so on.

Through these inputs, the network simulator 208 may be enabled to perform the following steps. For each calibration area 302, the network simulator 208 may

1) Simulate the environment assuming each person would have mobile device(s);

2) Collect simulated wireless measurements of the given calibration area 302;

3) Match the real wireless measurements collected from the static sensors 202 with the simulated measurements;

4) Adjust the simulation based on the matched wireless measurements;

5) Estimate the differences between real measurements and adjusted simulated measurements; and

6) Output the simulated wireless measurements and the estimated differences as inputs for ML-based crowd estimator 210.

According to embodiments of the invention, the ML-based crowd estimator 210 may be trained for each calibration area 302 separately and it may take into account the following inputs: 1) Wireless measurements 212 from the static devices in the calibration area 302;

2) Outputs of localization agent 214 for static devices;

3) Outputs of the digital twin’s 206 network simulator 208;

4) Estimation of crowdedness from a semantic scene simulator 216 (e.g. , planned event attendance/subscriber 218 from smart city services 220);

5) Outputs of the localization for mobile agent as ground truth (if available)

It should be noted that ML-based crowd estimators have already been proposed in prior art, particularly in Fang-Jing Wu and Gürkan Solmaz. “Crowdestimator: Approximating crowd sizes with multi-modal data for internet-of-things services.” In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pp. 337-349. 2018. However, the existing prior art solutions do not consider the digital twin environment 206, which may provide additional simulation data that can be either fully synthetic or outcomes of adjusted simulations using real-world data. As such, compared to prior art, the ML-based crowd estimator 210 according to embodiments of the present invention has the advantage that real measurements (wireless and ground-truth) as well as the simulated measurements can be merged together for training the crowd estimator 210. Later, the crowd estimator 210 is thus able to make estimates using either only wireless measurements or by leveraging both wireless and simulated measurements, without the need for ground-truth from a mobile agent 202. This way, the crowd estimator 210 provides additional advantages such as long-term accurate crowd estimation without using cameras or any personal information (e.g., using only fully-anonymized MAC addresses) and reducing the cost of energy, in particular networking and computation costs, due to constant transfer and processing of videos from the environment to the server side.

Smart Cities is the main application domain of the crowd estimation method/apparatus due to its benefits in areas such as city management, tourism services, epidemics monitoring, and others. The method/apparatus can operate in a smart city as exemplarily illustrated in Fig. 5, where various sensors 252, 254 are statically deployed in the environment and an urban digital twin platform supports the city operation. A mobile intelligent agent 202 (or multiple of them) acting as calibration planner and equipped with more accurate sensing capabilities (as compared to the less reliable sensors 252, 254) can move around the critical hotspots in the smart city for the purpose of calibrating the wireless/sensing measurements in that environment. The calibration may take place when the intelligent agent 202 is in the vicinity of a respective one of the static sensors 252, 254. After an area gets calibrated and starts making accurate measurements of the environment, the digital twin 206 notifies the intelligent agent 202 about the status, and the intelligent agent 202 re-visits the existing plan and updates it when necessary.

According to embodiments of the invention, the proposed method/apparatus for long-term accurate crowd estimation may relate to one or more of the following aspects:

1) Network simulator in the Digital Twin adds accuracy to crowd estimation through ML data augmentation with synthetic data

2) Intelligent agent enables network simulator through the environmental information sharing with digital twin and plans its tasks efficiently based on ML- based crowd estimation results from digital twin

3) Combination of digital twin and intelligent agent enhances the crowd estimation accuracy and privacy.

With reference to Fig. 3, in a concrete implementation, the present invention provides a method/apparatus for long-term accurate crowd estimation, comprising one or more the following steps/components:

Intelligent agent 202:

1. Intelligent crowd estimation training agent 230 collects and process environmental data

2. Image processing 242 processes environmental data for vehicle recognition 232, people recognition 234, object recognition 236 inputs

3. The inputs 232, 234, 236 are sent to digital twin platform 250 Large-scale sensors 204:

1. Wireless sensors 252 and assistant sensors 254 collect wireless and auxiliary data

2. Wireless and auxiliary data are sent to digital twin platform 250

Digital Twin 206:

1. Digital twin platform 250 receives inputs 252, 232, 234, 236, 254

2. Localization agent 214 applies localization on wireless data and sends result to network simulator 208

3. Network simulator 208 receives outputs from 232, 234, 236, 214 and simulates network

4. Event subscriber 218 collects information form city data services 220

5. Semantic scene simulator 216 receives events from event subscriber 218 and outputs from 232, 234, 236

6. Wireless data 212, localization data (from 214), and outputs from network simulator 208 and semantic scene simulator 216 are sent to ML-based crowd estimator 210

7. Crowd estimation outputs from 210 are sent to intelligent crowd estimation training 230

Intelligent agent 202:

1. Intelligent crowd estimation training agent 230 receives crowd estimator and scene simulation data from digital twin platform 250

2. Flexible calibration planner 240 plans the tasks of intelligent agent 202

3. Vehicle route planner 238 decides on movement routes of intelligent agent 202 (e.g. an autonomous street car equipped with accurate sensing capabilities).

The main technical advantages in application of embodiments of the present invention in smart cities are as follows:

1) Improved crowd estimation accuracy: The ML-Based Crowd Estimator 210 is supported by the Intelligent Agent 202 to learn the crowdedness for any given time with wireless and other information.

2) More robust crowd estimation for long-term deployments. 3) Improved personal privacy: The deployed sensors 204 as well as the intelligent agent 202 would not need to send any personal information to the digital twin side 206, whereas they will share their overall measurements (e.g., number of people in a city square). Furthermore, the operation of the intelligent agent 202 with high-accuracy sensors is considered temporary and once the crowd estimation is calibrated, the intelligent agent 202 does not need to be in use in the given environment. In particular, the solution does not depend on individual mobility tracking.

4) Improved use of equipment: The sensing and computation in the intelligent agent 202 is considered more advanced but costly compared to wireless sensing. For instance, a wireless sensing device may cost a very small amount of money and simply a city can afford buying thousands of these devices. On the other hand, an autonomous vehicle with high-accuracy sensors would cost much higher amounts and there will only be a limited amount. Moreover, the same vehicle might be shared by multiple cities.

5) Improved energy efficiency: The solutions existing heretofore of running cameras everywhere is causing much computation (video processing) and communication (transferring videos for server-side deep learning) that leads to energy consumption. In contrast, the present invention proposes using a limited number of cameras temporarily and relying on wireless sensing for the overwhelming majority of the operation duration. The wireless devices would require lower energy consumption and some energy harvesting mechanisms can be used to further minimize the energy consumption. On the other hand, the digital twin 206 might still cause considerable energy consumption, especially during the calibration/training period. However, this consumption would apply for a short-time training period to learn the environment characteristics, as opposed to long-term energy consumption through the whole operation period for crowd estimation.

Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

C l a i m s

1. A computer-implemented method for crowd estimation in an environment, the method comprising: receiving, from a mobile intelligent agent (202) having accurate sensing capabilities, environmental data and processing the received environmental data for vehicle, people, and/or object recognition; using wireless sensors (252) deployed in the environment to collect wireless signals from the environment; training, during a training phase, a machine learning model to learn correlations between the collected environmental data and the collected wireless signals; and performing, during an operational phase, crowd estimation based on collected wireless signals only and by calibrating the collected wireless signals using the trained machine learning model.

2. The method according to claim 1 , further comprising: generating, by a digital twin platform (250) based on the results of the environmental data processing together with the collected wireless signals, a simulation of the wireless networking characteristics of the environment.

3. The method according to claim 2, further comprising: dividing the environment into a number of calibration areas (302); performing, by a ML-based crowd estimator (210) of the digital twin platform (250) based at least on the simulation of the wireless networking characteristics of the environment, a crowd estimation for each calibration area (302) and determining for each crowd estimation a value indicative of a certainty of the respective crowd estimation and a value indicative of a temporality of the calibration used for the respective crowd estimation.

4. The method according to claim 3, further comprising: using, by the mobile intelligent agent (202), the results of the crowd estimations performed by the ML-based crowd estimator (210) to adapt the collecting of environmental data.

5. The method according to claim 3 or 4, further comprising: simulating, by a network simulator (208) of the digital twin platform (250), the environment assuming each person would have mobile device(s); collecting simulated wireless measurements of a respective calibration area (302); matching the wireless signals collected from the wireless sensors (252) with the simulated wireless measurements; adjusting the simulation of the environment based on the matched wireless measurements; estimating the differences between real measurements and adjusted simulated measurements; and outputting the simulated wireless measurements and the estimated differences as inputs for the ML-based crowd estimator (210).

6. The method according to any of claims 3 to 5, wherein the ML-based crowd estimator (210) of the digital twin platform (250), for performing crowd estimations, takes into account estimations of crowdedness from a semantic scene simulator (216) of the digital twin platform (250) that simulates the environment using semantic information regarding events taking place within the environment.

7. The method according to any of claims 3 to 6, further comprising: using, by a calibration planner (240) of the mobile intelligent agent (202), the crowd estimations of the digital twin platform’s (250) ML-based crowd estimator (210) to devise adapted movements of the mobile intelligent agent (202) within the environment in order to collect environmental data from within specific calibration areas (302).

8. The method according to claim 7, wherein the calibration planner (240) is realized by means of a Q-learning-based implementation.

9. The method according to claim 7 or 8, wherein the calibration planner (240), for devising adapted movements of the mobile intelligent agent (202) within the environment, takes into account travel times between calibration areas (302) and cross-correlation factors between calibration areas (302).

10. The method according to any of claims 2 to 8, further comprising: using assistant sensors (254) deployed in the environment including at least one of humidity, noise, and CO₂ sensors to collect auxiliary sensing information from the environment; and using, by the digital twin platform (250) the auxiliary sensing information together with the collected wireless signals for the simulation of the wireless networking characteristics of the environment.

11. An apparatus for crowd estimation in an environment, in particular for execution of a method according to any of claims 1 to 10, the apparatus comprising one or more processors which, alone or in combination, are configured to provide for execution of receiving, from a mobile intelligent agent (202) having accurate sensing capabilities, environmental data and processing the received environmental data for vehicle, people, and/or object recognition, training, during a training phase, a machine learning model to learn correlations between the collected environmental data and wireless signals collected from the environment via wireless sensors (252) deployed in the environment, and performing, during an operational phase, crowd estimation based on collected wireless signals only and by calibrating the collected wireless signals using the trained machine learning model.

12. The apparatus according to claim 11 , further comprising: a digital twin platform (250) configured to generate, based on the results of the environmental data processing together with the collected wireless signals, a simulation of the wireless networking characteristics of the environment.

13. The apparatus according to claim 12, wherein the digital twin platform (250) comprises a ML-based crowd estimator (210) configured to divide the environment into a number of calibration areas (302), and perform, based at least on the simulation of the wireless networking characteristics of the environment, a crowd estimation for each calibration area (302) and determine for each crowd estimation a value indicative of a certainty of the respective crowd estimation and a value indicative of a temporality of the calibration used for the respective crowd estimation.

14. The apparatus according to claim 13, wherein the digital twin platform (250) comprises a network simulator (208) configured to simulate the environment assuming each person would have mobile device(s); collect simulated wireless measurements of a respective calibration area (302); match the wireless signals collected from the wireless sensors (252) with the simulated wireless measurements; adjust the simulation of the environment based on the matched wireless measurements; estimate the differences between real measurements and adjusted simulated measurements; and output the simulated wireless measurements and the estimated differences as inputs for the ML-based crowd estimator (210).

15. The apparatus according to claim 13 or 14, further configured to provide the crowd estimations of the digital twin platform’s (250) ML-based crowd estimator (210) to a calibration planner (240) of the mobile intelligent agent (202) that uses the crowd estimations to devise adapted movements of the mobile intelligent agent (202) within the environment for collecting environmental data from within specific calibration areas (302).