WO2021255640A1

WO2021255640A1 - Deep-learning-based computer vision method and system for beam forming

Info

Publication number: WO2021255640A1
Application number: PCT/IB2021/055268
Authority: WO
Inventors: Mohamed-Slim Alouini; Yu Tian; Gaofeng PAN
Original assignee: King Abdullah University Of Science And Technology
Priority date: 2020-06-16
Filing date: 2021-06-15
Publication date: 2021-12-23

Abstract

A method for calculating beam indices in a wireless communication system includes receiving (1800) n images of a target user terminal (180-2) and objects (182, 204) around the target user terminal (180-2), extracting (1802) visual features (303) and motion features (305) associated with the target user terminal (180-2) and the objects (182, 204), generating (1804), in a given cell (800-I) of a deep-learning network (300), an output vector o _j based on (1) merged features (320), (2) an initial state (610) received from a previous cell (800-(I-1)) of the deep-learning network (300), and (3) a beam index b _i, and calculating (1806) a future beam index I _m for the target user terminal (180-2) for a future time. The future beam index I _m is calculated to avoid obstacles between an antenna (166) of the wireless communication system (160) and the target user terminal (180-2) at the future time.

Description

DEEP-LEARNING-BASED COMPUTER VISION METHOD AND

SYSTEM FOR BEAM FORMING

CROSS-REFERENCE TO RELATED APPLICATIONS [0001]This application claims priority to U.S. Provisional Patent Application No. 63/039,805, filed on June 16, 2020, entitled “DEED-LEARNING-BASED COMPUTER VISION METHOD FOR MILLIMETER WAVE MULTIPLE-IMPUTE MULTIPLE-OUTPUT BEAMFORMING,” the disclosure of which is incorporated herein by reference in its entirety.

BACKGROUND

TECHNICAL FIELD

[0002] Embodiments of the subject matter disclosed herein generally relate to a system and method for forming a beam in a wireless communication system, and more particularly, to using deep-learning-based computer vision techniques for determining characteristics of a target and beamforming in the wireless communication system.

DISCUSSION OF THE BACKGROUND

[0003] Deep-learning (DL) has seen great success in the computer vision (CV) field, and techniques related to these algorithms have been used in security, healthcare, remote sensing, and many other areas. DL networks include networks such as deep neural networks, deep belief networks, recurrent neural networks (RNNs), and convolutional neural networks (CNNs). Many DL networks with various structures have emerged with the availability of large image and video datasets and high-speed graphic processing units (GPUs). DL networks can achieve success in CV because they discover and integrate low-/middle-/high-level features in images and leverage them to accomplish specific tasks. DL can fulfill CV applications with remarkably high performance, such as semantic segmentation, image classification, and object detection/recognition. DL-based CV systems have therefore been widely utilized in public security, healthcare, and remote sensing, as such fields generate much visual data.

[0004] However, a DL-based CV system is rarely seen in the design and optimization of wireless communication systems, in which the researchers mainly focus on the transmission quality of the bits/packets, e.g., transmission rate, bit/packet error, traffic/user fairness, etc. via purely exploiting the information on the transmission behavior of radio frequency signals (e.g., the power, direction, phase, transmission duration, etc.), rather than making use of the geometry information of the surrounding space.

[0005] Thus, there is a need to integrate the DL-based CV capabilities into the existing telecommunication networks for more judiciously controlling the power, channel resource schedule, beamforming and other characteristics of the telecommunication system. BRIEF SUMMARY OF THE INVENTION

[0006] According to an embodiment, there is a method for calculating beam indices in a wireless communication system, and the method includes receiving n images of a target user terminal and objects around the target user terminal, where n is a positive integer, extracting visual features and motion features associated with the target user terminal and the objects around the target user terminal, generating, in a given cell of a deep-learning network, an output vector o, based on (1) merged features, obtained from the visual features and the motion features, (2) an initial state received from a previous cell of the deep-learning network, and (3) a beam index b_i, wherein j = i + 1 and i and j are positive integers, calculating a future beam index l_m for the target user terminal for a future time, wherein m is a positive integer The future beam index l_m is calculated to avoid obstacles between an antenna of the wireless communication system and the target user terminal at the future time.

[0007] According to another embodiment, there is a computing device for calculating beam indices in a wireless communication system. The computing device includes an interface configured to receive n images of a target user terminal and objects around the target user terminal, where n is a positive integer, and a processor connected to the interface. The processor is configured to extract visual features and motion features associated with the target user terminal and the objects around the target user terminal, generate, in a given cell of a deep-learning network, an output vector O_j based on (1) merged features, obtained from the visual features and the motion features, (2) an initial state received from a previous cell of the deep-learning network, and (3) a beam index b_i, wherein j = i + 1 and i and j are positive integers, and calculate a future beam index l_m for the target user terminal for a future time, wherein m is a positive integer. The future beam index l_m is calculated to avoid obstacles between an antenna of the wireless communication system and target user terminal at the future time.

[0008] According to another embodiment, there is a non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement the method discussed above for calculating beam indices in a wireless communication system.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

[0010] Figure 1 is a schematic diagram of a deep-learning-based computer vision system that is configured to optimize one or more parameters of a wireless communication network;

[0011] Figure 2 illustrates an actual scenario in which beams formed by the wireless communication network can reach a target user terminal directly and by reflection. The reflection happens when the line-of-sight path between the transmitting antenna and target user terminal is blocked;

[0012] Figure 3 illustrates a configuration of a deep-learning network used by the deep-learning based computer vision system;

[0013] Figure 4 illustrates a specific implementation of a layer of the deep-learning network for extracting visual features from various images;

[0014] Figure 5 illustrates another specific implementation of a layer of the deep learning network for extracting motion features from various images;

[0015] Figure 6 illustrates a first configuration of the deep-learning-based predictive network as shown in Figure 3;

[0016] Figure 7 illustrates a second configuration of the deep-learning-based predictive network as in Figure 3; [0017] Figure 8 illustrates a configuration of a LSTM cell of the deep-learning-based predictive network in Figure 3;

[0018] Figure 9 is a flow chart of a method for training the deep-learning network; [0019] Figure 10 is a schematic diagram of the steps taken to train the deep-learning network;

[0020] Figure 11 is a schematic diagram of the steps taken to calculate predicted beam indices for a given target terminal user;

[0021] Figures 12A and 12B compare the results for a baseline method and the method developed based on the deep-learning networks of the previous figures; [0022] Figure 13 schematically illustrates 3 applications on how one or more parameters of a wireless communication cellular network can be adjusted based on the results of the deep-learning network;

[0023] Figure 14 illustrates a first possible application of the deep-learning-based computer vision to the wireless communication network;

[0024] Figure 15 illustrates a second possible application of the deep-learning-based computer vision to the wireless communication network;

[0025] Figure 16 illustrates a third possible application of the deep-learning-based computer vision to the wireless communication network;

[0026] Figure 17 illustrates a fourth possible application of the deep-learning-based computer vision to the wireless communication network;

[0027] Figure 18 is a flow chart of a method for calculating beam indices in the wireless communication system based on the deep-learning network; and [0028] Figure 19 is a schematic diagram of a computing device in which the deeplearning network can be implemented.

DETAILED DESCRIPTION OF THE INVENTION

[0029] The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims. The following embodiments are discussed, for simplicity, with regard to a millimeter wave multiple- input multiple-output beamforming system. However, the embodiments to be discussed next are not limited to such a system, but may be applied to beamforming in other communication systems irrespective of the size of the wavelength.

[0030] Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.

[0031] According to an embodiment, a system for predicting future beam indices from previously observed beam indices and street view images using a DL network (for example, ResNet, 3-dimensional ResNext, and a long short-term memory network) is introduced. One or more cameras installed on or next to base stations of a wireless communication system provide the street view images. The DL algorithm detects one or more targets of interest and objects around these targets and predicts their future positions. Based on these predictions, the wireless communication system adjusts one or more of its characteristics, for example, beam power or beam direction, and forms one or more beams that are directed to the one or more targets of interest. The experimental results show that this method achieves much higher accuracy than a baseline method, and that visual data can significantly improve the performance of a multiple-input and multiple-output (MIMO) beamforming system. Note that the current telecommunication system uses one or more MIMO configurations for establishing communication between the base stations (e.g., cell towers) and the user terminals (i.e. , smart devices).

[0032] The embodiments discussed herein take advantage of the high-definition cameras that are installed almost everywhere because of their low cost and small size. In some public areas, cameras have long existed for monitoring purposes. Therefore, visual data can easily be obtained in wireless communication systems in real life. Useful information about the static system topology (including user terminals’ numbers, positions, distances among themselves, positions of objects around the user terminal, heights of the objects around the user terminal, etc.) and dynamic system changes (including moving speed, direction, and changes in the number of the terminals) can be recognized, estimated, and extracted from these multi-medium data via DL-based CV techniques. This information is then exploited for wireless communications to aid system design/optimization, such as resource scheduling and allocations, algorithm design, and more.

[0033] According to an embodiment illustrated in Figure 1, a DL-based CV system 100 that can be integrated with a wireless communication system 160 that provides wireless communications is configured to explore the useful information obtained/forecasted by the DL-based CV algorithm to facilitate the design of wireless communications via DL-based/traditional optimization methods. The DL-based CV system 100 receives at block 102 images and/or videos from existing video cameras. The video cameras 162 can be mounted on any structure associated with the communication system 160 or on any other structure adjacent to the communication system. In one embodiment, one or more of the cameras 162 may be installed on the targets themselves, for example, vehicles or user terminals (e.g., cell phones). The images are then processed by a processing block 104, based on DL-based CV algorithms 106, and various characteristics (e.g., object detection, scenario reconstruction, state prediction, etc.) are generated. These characteristics describe the various user terminals and other targets and their environment (e.g., building, roads, bridges, etc.). Then, static and dynamic information about the user terminals and their environment is collected at block 108. At block 110, the system 100 accesses or interfaces with the communication system 160 to get access to one or more layers 164 of the wireless communication algorithm. An optimization block 112 receives the static and dynamic information from block 108 and also specific parameters of the one or more layers 164 of the communication system 160 and generates an optimal solution 114 for the communication system 160. This solution, which may include, for example, the beam power, the direction of the beam, etc., is then fed back to the communication system 160 for beamforming according to this solution. Finally, the communication system 160 uses the MIMO antennas 166 at one or more base stations 168 to send one or more beams 170 to the communication device 180 of one or more users. Note that the beams 170 are directed to the actual position of the communication device 180 because of the static and dynamic information generated in block 108. The DL-based CV system may be implemented as a stand-alone unit as shown in Figure 1 or directly into the wireless communication system 160. The DL-based CV system 100 is now discussed in more detail.

[0034] In the following embodiment, the results of the DL-based CV system 100 are implemented into the wireless system 160 at one or more of the following layers 164: the physical layer 164-1, the medium access control (MAC) layer 164-2, and the network layer 164-3. The physical layer 164-1 of the wireless communication system 160 uses traditional methods that usually first estimate the channel state by sending pilot signals from the transmitter 166 to the receiver 180. Then, according to the achieved channel state information (CSI), specific modulation, source encoding, channel encoding, and power control strategies can be selected to realize the optimal utilization of the system 160’s resources (e.g., bandwidth and energy budgets). However, the CSI only contains amplitudes and phases information of the channel fading rather than the locations, number, and environmental information of the user device 180, leading to the fact that the real optimal system performance cannot be realized. However, with the aid of the more comprehensive users’ information generated by block 104, dynamic modulation, encoding, and power control can be optimally formulated and implemented. For example, for the MIMO beamforming communication system 160, the direction and power of the beams 170 can be scheduled using the knowledge of the user terminal’s 180 locations and blocking obstacles inferred from the visual data, which cannot be obtained via traditional methods.

[0035] For the MAC layer 164-2, like in the cellular wireless networks, receiver-to- transmitter feedback information and cell-to-cell CSI is utilized to allocate resources and to guarantee the quality of service in the traditional methods. Thus, a long time delay may always exist when analyzing the feedback and CSI in crowded scenarios in which a large number of user terminals are served by the network 160. By jointly using this information and the density or distribution of the user terminals 180 obtained from the visual data in the serving area of the base station 168, channel resources (including frequency bands, time slots, etc.) can be efficiently reserved and allocated to achieve the optimal overall performance. For example, smart homes have various kinds of terminals such as smartphones, televisions, laptops, and other intelligent home appliances. As such, channel resources can be dynamically scheduled by considering the information obtained from the visual data, such as the number and locations of the user terminals. Unlike traditional handover algorithms that use the measured fluctuation of the received signal’s power to estimate the distance between the terminal and base station, the moving information obtained at block 104, which may include the velocity and its variations, can be fully estimated from the visual data to accurately facilitate channel resource allocation in the handover process. This is very desirable for fifth-generation 5G wireless networks due to the shrinking sizes of the serving zones.

[0036] For the network layer 164-3, taking multi-hop transmission scenarios as an example, traditional routing algorithms are mostly running based on the length of the routing path estimated by the pilot and feedback signals, which cannot reflect the actual location changes of the mobile user terminals. By exploiting the topology information associated with the users, from the visual data, novel routing algorithms can be designed to efficiently improve transmission performance, such as the end- to-end delivery delay, packet loss rate, jam rate, and system throughput. In another case, wireless sensor networks have numerous sensors that can be deployed in target areas to monitor, gather, and transmit information about their surrounding environments. For this case, the system topology information extracted from the visual data can be used to design multi-hop transmissions, which are required due to the inherent resource limitations and hardware constraints of the sensors.

[0037] From these examples it is evident that the traditional algorithms adopted by the wireless communication systems depend on traditional channel/network state estimation methods to grab the CSI and network state information, which unavoidably suffer time delay and/or feed errors, resulting in low efficiency or even wrong decisions. In particular, it is hard or impossible to get accurate CSI or network state information in highly dynamic network scenarios through the traditional methods. Thus, the system 100 shown in Figure 1, by using one or more of the DL- based CV techniques, can accurately and efficiently extract the static and dynamic system information from the recorded visual data, bringing vital benefits to the design and optimization of the wireless communication system 160.

[0038] Applying DL-based CV to wireless communications has two aspects: the datasets that are used for training the system 100, and to what other systems the results of the DL-based CV system are applied, i.e., their applications. With regard to the first aspect, building datasets is a necessary step as the DL system is data- hungry. Some authors [1] proposed a parametric, systematic, scalable dataset framework, called Vision-Wireless (ViW), for these systems. The authors utilized a DL-based CV framework to build the first-version dataset containing four scenarios with different camera distributions (co-located and distributed) and views (blocked and direct). These scenarios were based on a millimeter wave (mmWave) MIMO wireless communication system. Each scenario contained a set of images captured by the cameras and raw wireless data (signal departure/arrival angles, path gains, and channel impulse responses). Using a MATLAB script, the authors could view the user’s location and channel information in each image from the raw wireless data. Later, the same authors built the second-version dataset called ViWi Vision-Aided Millimeter- Wave Beam Tracking (ViW-BT) [2] This dataset contains images captured by the co-located cameras and mmWave MIMO beam indices under a predefined codebook. The authors in [3] introduced another dataset, called Raymobtime, which contains ray-tracing, LIDAR, matrix channel, GPS, and image data in mmWave MIMO vehicle-to-infrastructure wireless communication systems. Notably, ray-tracing data provides path parameters such as received power, time of arrival, angle of departure, angle of arrival, line of a sight ray status, and ray phase while GPS user info data has line-of-sight (LOS) status, channel valid or not information, number of the TX in the vehicle, and the 3D coordinates. A dataset consisting of depth image frames from recorded videos was also built and can be applied in channel estimation tasks. [0039] With regard to the potential applications of the DL-based CV system 100, there are a couple of possibilities in the wireless communication environment. A framework to implement beam selection in mmWave communication systems by leveraging environmental information was presented by [4] The authors used the images with different perspectives captured by one camera to construct a three- dimensional (3D) scene and generate corresponding point cloud data. They built a model based on a 3D CNN to learn the wireless channel from the point cloud data and predict the optimal beam. Based on the first-version ViW dataset, the authors in [5] proposed a modified ResNet18 model to conduct beam and blockage prediction, based on the images and channel information. Based on the second-version ViWi- BT dataset, the authors in [2] provided a baseline method based on Gated Recurrent Units (GRUs) without the images, only the beam indices. The authors in [2] believe that they can achieve better performance if they leverage both kinds of data. Based on the Raymobtime dataset, CNN and deep reinforcement learning (DRL) were utilized to select a proper pair of beams for vehicles with images generated from GPS locations data in vehicle-to-infrastructure scenarios in [3] The authors also compared DL-based methods with other traditional machine learning methods such as SVM, AdaBoost, decision tree, and random forest. The results showed that the DL-based method has the best performance. In one application, two CNNs were proposed to conduct line-of-sight decision and beam selection by using LIDAR point cloud data in the Raymobtime dataset. In another application, it was proposed a neural network containing CNNs and an RNN-based recurrent prediction network to predict the dynamic link blockages using red, green, blue (RGB) images and beamforming vectors provided by the extended ViWi-BT dataset. In [6], the authors developed a CNN-based framework, called VVD, to estimate the wireless communication channels only from only the depth images in mmWave systems. In [7], a framework consisting of CNN and convolutional long short-term memory (LSTM, convLSTM) network was presented to proactively predict the received power through depth images in mmWave networks and exhibited the highest accuracy compared with the random forest algorithm and a CNN-based method. In [8], a proactive handover management framework was proposed to make handover decisions by using camera images and DRL. In [9], a multimodal split learning method based on convLSTM networks was presented to predict mmWave received power through camera images and radio frequency signals while considering communication efficiency and privacy protection.

[0040] In this embodiment, MIMO and beamforming assisted by the DL-based CV 100 is discussed. MmWave communication is a promising technique in the fifth- generation communication system due to its broad available bandwidth and ultra- high data-transmitting rate. MIMO and beamforming are widely used in mmWave communication systems and are implemented in a large antenna array to achieve the required high-power gain and direction. The classic beamforming and beam tracking algorithms suffer a common disadvantage: their complexity increases dramatically with the number of antennas, resulting in substantial computational overhead. The DL-based CV system addresses this overhead issue.

[0041] As defined in [2] and shown in Figure 2, a typical scenario encountered by a wireless communication system 160 deployed in a city is the presence of high-risers 202 and 204, which are located on the same street 206 separated by a certain distance, for example, 60 meters. The base station (BS) 168 forms a MIMO beam 170-I for each target user 180-I (here I takes the value 2, but it may take any whole value) moving along the street 206. Therefore, the beam 170-l’s direction must be dynamically adjusted by the BS to catch the corresponding target mobile user 180-I. A target user terminal 180-2 may be blocked at some moments, for example, at t₈ in Figure 2, by an obstacle 182 (a truck in this case, but other obstacles may block the direct beam 170-2) and then the beam cannot directly reach the target user 180-2. However, as shown in Figure 2, there is possible to generate a reflected beam 170- 2’, which reflects, for example, from the building 204 (but is also possible to reflect from other objects, such as other vehicles), and arrives at the desired target 180-2. The DL-based C V system 100 is implemented herein to detect that the beam 170-2 is blocked, to determine/predict the reflected beam 170-2’, and to help the communication system 160 implement the new beam.

[0042] For this purpose, three cameras 162 were installed on the tower 169 of the BS 168 to capture RGB images 210-I (where I takes values between 1 and n=8 in this embodiment, but other smaller or larger values may be used) of the street where the target user terminal 180-2 is present, to assist the beamforming process. It is noted that less or more cameras may be used for this purpose. Also, the cameras may capture any type of images, for example, color images, black and white images, infrared images, etc. With the captured data, the problem that needs to be addressed is how to utilize then=8 pairs of previously-observed consecutive beams and corresponding images to predict the future one, three, and five beams with the DL network. Notably, these beams are represented as beam indices under the same predefined codebook. Note that any number of future beams may be predicted. [0043]A sequence containing the n=8 pairs of (i) previously-observed images 210-I and (ii) corresponding beam indices b_i for the u^th user at the time t_i is given as:

where X_u[i] is the RGB image taken at the rth time instance and b_u[i] is the corresponding beam index generated by the wireless communication system.

[0044] A prediction function of the DL network is selected to be

and a predicted beam index at the time instance t + m is then

with m = 1, ... , 5. Note the less or more than 5 beam indices may be predicted. For this scenario, the prediction function

takes in the sequence

(which has n=8 terms in this embodiment, but less or more terms may be used) and outputs a predicted sequence . The symbol Q in the prediction function represents a set of

parameters of the DL model which is obtained by training the model with a training set. The training set includes labelled sequences, i.e.,

where each pair in the set includes an observed sequence S and five

groundtruth future beam indices g_u.

[0045] The goal of this embodiment is to get the prediction function

which can maximize the joint success probability of all data samples in the dataset D. The object function used to maximize the joint success probability is expressed in this embodiment as:

where each success probability only relies on its observed sequence S_u[t] Other object functions may be used.

[0046] For calculating the predicted sequence , the DL network 300

illustrated in Figure 3 is used. Those skilled in the art would understand that this framework is just an example and other configurations may be used for the DL network with similar results. The DL network 300 shown in Figure 3, which essentially is the underlying structure for the blocks 104, 106, 108, and 112 in Figure 1, includes a residual neural network (ResNet) block 302 and a ResNext block 304, each of each is configured to receive the images X_u[i] collected by the cameras 162. The figure indicates that only n=8 images are used in this embodiment. However, as noted above, more or less images may be used if the DL network 300 is adjusted accordingly. The DL network 300 further includes a feature-fusion module (FFM) 306 and a predictive network 308. As the internal structure of the ResNet block, the ResNext block, and the FFM module are known, their description is omitted as the reader can obtain this information from, for example, K. He et al., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp. 770-778, K. Hara et al., “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 6546-6555, and B. Wang et al., “Controllable video captioning with POS sequence guidance based on gated fusion network,” in Proc. IEEE Int. Conf. Comput. Vis., Seoul, South Korea, Oct/Nov. 2019, pp. 2641-2650.

[0047] In this embodiment, the ResNet block 302 includes several residual blocks 400-J, where J is a whole number, as illustrated in Figure 4. Each block 400-J includes two or more convolutional layers and superimposes its input to its output through an identity mapping. It can efficiently address the vanishing gradient issue caused by the rising number of convolutional layers. If a specific number of such blocks are concatenated, as depicted in Figure 4, the ResNet block 302 is available to achieve as many as 152 layers.

[0048] Figure 5 presents the structure of a ResNext block 500, which is an improved version of the ResNet block 302, that adds a ‘next’ dimension, also called “Cardinality.” The ResNext block 500 sums the outputs of K parallel convolutional layer paths 502-K, which share the same topology and inherits the residual structure of the combination. As K diversities are achieved by K paths, this block can focus on more than one specific feature representations of the images. For the 3D ResNext block 304, a similar structure can be observed but with 3D convolutional layers instead of two-dimensional (2D) ones. The 3D convolutional layer is designed to capture spatiotemporal 3D features from raw videos.

[0049] The ResNet and 3D ResNext blocks have been widely used as feature extractors for their powerful feature-representation abilities. If they are used in a DL network directly as in Figure 3, however, the training time will become extremely long, and many computational resources will be occupied due to the large number of layers. Therefore, it is customary to apply a pre-trained ResNet on the ImageNet dataset to extract visual features from the images and to apply a 3D ResNext on the Kinetics dataset to extract spatiotemporal features from videos. These features are then fed to the DL network as inputs.

[0050] The FFM module 306 includes, as shown in Figure 3, two long short-term memory (LSTM) networks 310, one which is configured to receive the data from the ResNet module 302 and the other one configured to receive the data from the 3D ResNext module 304. The LSTM network 310, see [10], is configured for tasks that contain time-series data, such as prediction, speech recognition, text generation, etc. Hence, it is a suitable candidate for the predictive network 308. The LSTM network 310, which is shown in more detail in Figures 6 and 7, includes several LSTM cells 800-L, as illustrated in Figure 8, where L is a whole number equal to or larger than the number of input images n. For example, in this embodiment, because there are n=8 input images and it is desired to predict 5 further beam indices, there are 12 LSTM cells for the 1D LSTM network 310/600 of Figure 6. More LSTM cells are used for the 2D network 310/700 shown in Figure 7.

[0051] The following data is used as the input for the LSTM cell 800-L in Figure 8: an event 802 (which describes the current state), previous long-term memory 804 (which describes the cell state), and previous short-term memory 806 (which describes the hidden state). These inputs of the LSTM cell 800-L are provided to the learn gate 810, forget gate 812, remember gate 814, and output gate 816 and are employed to explore the information from the inputs. The LSTM cell 800-L outputs new long-term memory data 820 and short-term memory data 822, in which the latter is also regarded as a prediction. The data 820 and 822 is then provided to a next LSTM cell as the initial state 610.

[0052] When the LSTM cell 800-L is recursively utilized in series in a 1D array form, as shown in Figure 6, then the 1 D LSTM network 600 is formed. At each moment, the cell and the hidden states of the previous moment are used to generates the outputs of the current moment, as illustrated in Figure 8. A 2D LSTM network 310/700 can be realized when the LSTM cell 800-L is recursively utilized in a 2D mesh form as shown in Figure 7. For the 2D mesh form, each LSTM cell utilizes the hidden and cell states from two neighboring cells 800-(L-1) and 800-(L-2), which are in the left and below positions in the mesh in Figure 7, and its initial state 610 is delivered to its neighboring cells in the right and top positions. For this case, the number of predictions is equal to the number of rows of the mesh. Note that for the network 600 shown in Figure 6, each cell receives an input from a previous cell (i.e. , the state 610 of the previous cell) and the last four cells, i.e., cells 800-9 to 800-12, also receive the output of the previous cell, e.g., cell 800-9 receives the output from cell 800-8. The outputs from the cells are labeled q,· with j being a positive integer that is one more than i, i.e., j = i + 1.

[0053] The FFM module 306 in Figure 3 includes, in addition to the two LSTM networks 310, a cross-gating block 312. The features 303 from the ResNet block 302 and the features 305 from the 3D ResNext block 304 are aggregated by the LSTM networks 310 and then high-level merged features 320 are obtained. The cross gating block 312 can make full use of the related semantic information between these two kinds of features 303 and 305 by multiplication 314 and summation 316 operations. Then, the merged features 320 can be obtained through a linear transformation 318.

[0054] For the DL network 300, n=8 consecutive images T1 to T8 are inputted and utilized. As each image is equivalent to a video clip, they contain motion information, which is helpful for the beam prediction. Combined with the visual information from each image, e.g., location of the various user terminals, targets and obstacles, the motion of these elements and the blockage information can be extracted from the images T1 to T8. The pre-trained 3D ResNext with 101 layers (3D ResNext101) is adopted to extract the motion features 305 (e.g., speed, direction, acceleration, etc.) and the pre-trained ResNet with 152 layers (ResNet152) is used to extract the visual features 303 (e.g., locations of the user terminals, targets, obstacles, heights of the targets, etc.). These features 303 and 305 are then merged as a vector 320 through the FFM 312 and sent to the predictive network 308.

[0055] As illustrated in Figures 6 and 7, there are three kinds of inputs for the predictive networks 600 and 700, namely, the initial state 610, the embedded beam vectors b_ii, and the merged feature vector 320, where “i” varies up to and including n=8 in this embodiment. The initial state 610 is set as a vector having only zeros for the first cell. However, for the next cells in the network, the initial state 610 includes the long-term 804 and the short-term memory 806. The long-term memory includes information of previous time slots like beams, locations and speeds. The short-term memory includes the beam information of this time slot. The embedded vectors b_ii, represent a mapping of a constant (beam index) to a vector and they are indicative of the direction of each beam generated by the base station. Thus, the embedded vector is utilized to represent the beam index. In each LSTM cell 800-L, the embedded beam vector b ,· and the merged feature vector 320 are firstly transformed to the same shape and then summed up as the ‘event’ 802. According to (1) the event 802, and (2) short and long term memories 806 and 804, respectively, obtained from the previous LSTM cell 800-(L-1), the current LSTM cell 800-L predicts a future output vector O_j, whose index of the maximum element is the predicted beam index. Notably, all the LSTM cells share the same merged features 320.

[0056] Based on the 1D and 2D LSTM networks 600 and 700 illustrated in Figures 6 and 7, three methods are proposed and explained now. A first method is using the 1D LSTM network 600 of Figure 6. The LSTM cell 800-L is recursively used 12 times in this method. The cell at the kth moment is denoted as the ‘kth LSTM cell’. As shown in Figure 9, during the training process, the flow of this first method is as follows. In step 900, n=8 consecutive images T1 to T8 are fed to the pre-trained ResNet152 and 3D ResNext101 blocks and then visual features 303 and motion features 305 are extracted. The features 303 and 305 are merged through the FFM module 306 in step 902 and the output vector 320 is generated. In step 904, the merged features and the previous beam index are fed to each LSTM cell 800-L as an input, as shown in Figure 6. In step 906, the embedded vectors b_i, of the first 8 beam indices go through the first to the eight LSTM cells to update the hidden states. The last four cells do not receive as input the embedded vectors that correspond to actual beams. All the 12 LSTM cells generate corresponding 12 output vectors oy. Note that the embedded vectors b ,· for the first 8 cells are obtained from the communication network 160 (i.e., they represent actual beams sent to the target user terminal) while the last 4 cells receive embedded vectors that are calculated from the corresponding output vectors O_j of the previous corresponding cell, i.e., the embedded vector b₉ fed at cell 800-9 is calculated from the output O₉ of the previous cell 800-8 as indicated by arrow 620 in Figure 6. The 12 output vectors are then used in step 908 to calculate the training loss 1000 with the ground truth and train the network 100, as schematically illustrated in Figure 10. Any loss function 1002 may be used.

[0057] During the actual implementation of this network for actual processing, as only the first eight beam indices b, and their corresponding images are available, the step 906 is not applicable and instead the following two steps are implemented: the embedded vectors b_i, of the first eight beam indices go through the first to eighth LSTM cells and update the hidden states, as shown in Figure 11, and then, the eighth to twelfth LSTM cells are used to predict the future beam indices i.e., I₉ to 113, which are obtained by acquiring the indices of the maximum element in these output vectors (i.e., using corresponding Argmax layers 1110). Each cell of the 9^th to 12^th cells in Figure 11 is fed with the hidden state and also with the embedded beam index predicted by the previous LSTM cell, by using the embedding layer 1120. The step 908 is also not performed during the actual implementation of the algorithm. [0058] The second method uses a modified 1 D LSTM network. For the first method discussed above with regard to Figure 9, the training and testing procedures were different. In other words, the first method essentially aims to predict the next beam index as all the first 12 beam indices are used as inputs during the training process. During the application process, among the eighth to twelfth predicting beam indices, the previous one’s correct prediction is important for the next prediction. To make the training and testing processes consistent with each other, the second method modifies the first method so that the output vector of each of the last five LSTM cells 800-8 to 800-13 undergoes a linear transformation module and is fed to the next cell as the embedded input b_i. In this way, only the first eight beam indices are used as inputs, and the training and testing steps can be the same.

[0059] A third method uses the 2D LSTM network 700 shown in Figure 7 as the predictive network. In this method, the initial state 610, the merged feature vector 320, and the embedded vectors of the first eight beam indices bi to bs are input into the LSTM network 700, and then the network generates five outputs vectors og to O₁₃. The training process is the same as the testing one.

[0060] The three methods discussed above were tested with the ViWi-BT dataset. All the experiments were conducted in the framework of PyTorch on one NVIDIA V100 GPU. The VIWI-BT dataset contains a training set with 281,100 samples, a validation set with 120,468 samples, and a test set with 10,000 samples. There are 13 pairs of consecutive beam indices and corresponding images of street views in each sample of the training and validation sets. Furthermore, the first eight pairs correspond to the observed beams for the target user terminal and the sequence of the images in which the target user terminal appears. The last five pairs are groundtruth data containing the future beams and images of the same user. In this experiment, the first eight pairs serve as the inputs of the DL network 100 to generate the predicted future five beam indices, which are compared with the groundtruth ones.

[0061] First, the pre-trained ResNet152 and 3D ResNext101 blocks were used to extract 2048-dimensional visual and 8192-dimensional motion features from the first eight images of each sample. The merged features 320 are embedded as a 463- dimensional vector and fed to the predictive LSTM network 308. There are a 512- dimensional hidden size and a 129-dimensional output vector O_j in each LSTM cell 800-L. The training process discussed above with regard to Figure 9 is implemented to train the proposed network. During the training stage, the DL network 300 is optimized by the Adam optimizer. The learning rate is set as 4 x 10^-4 at first and reduced by half every eight epochs. The batch size is set as 256. The cross-entropy loss is utilized for the loss function.

[0062] Following the evaluation in [2], the performances of the proposed three methods are evaluated on the validation set with the same metrics, which are the top-1 accuracy and exponential decay score. As defined in [2], the top-1 accuracy of m future beams is expressed as:

where M is the number of the samples in the validation set, /{·} is the indicator function, and represent the predictive and groundtruth beam indices

vectors of the /th sample with the length m.

[0063] The exponential decay score of n future beams is given as:

where σ = 0.5 is a penalization parameter.

[0064] The table in Figures 12A and 12B list the results of these methods, in which the baseline method in [2] is considered for comparison purposes. In the baseline method, the authors simply leveraged the beam-index data and ignored image data. From the top-1 accuracy, one can see that the proposed novel method with the 1D LSTM network outperforms the base-line method in [2], The method with a modified 1D LSTM network is better than the baseline method on Ί future beam’ and ‘3 future beams’. The method with only the 2D LSTM network performs better than the baseline method on ‘1 future beam’. For the exponential decay scores, the designed methods with the 1D LSTM network and modified 1D LSTM network absolutely outperformed the baseline method. The method with the 2D LSTM network is better than the baseline on Ί future beam’ and ‘3 future beams,’ but a little worse on ‘5 future beams’.

[0065] The proposed novel methods outperform the baseline method on predicting ‘1 future beam’ because the location, blockage, and speed information of the target user terminal is extracted from the images and represented as motion and visual features to assist the prediction, and advanced LSTM networks are leveraged as the predictive networks. Among the three proposed novel methods, the method with the 1D LSTM network shows the best beam prediction performance in the target mobile scenarios. Compared with this method, the other two exhibit extra linear transformation modules or more LSTM cells in their predictive networks and need to be trained by more data. Therefore, performance degradation occurring on the predictions of ‘3 future beams’ and ‘5 future beams’ are caused by the small size of the training dataset.

[0066] The computational complexity here is measured by the running time of Ί future beam’ prediction, which exhibits the best performance and is more likely to be implemented in a practical wireless communication system. The running time of the novel methods consists of the execution time of feature extraction, FFM, and predictive network. It takes 0.42 seconds for the pre-trained 3D ResNext101 and ResNet152 to extract features from each set of eight images. The method with the 2D LSTM network exhibits the longest average running time due to its more complex structure shown in Figure 7. The baseline method runs for the shortest time as it utilizes simple GRUs as the predictive network and abandons the image data. The method with the 1 D LSTM network shows the best predictive performance, but a moderate prediction time, 0.016 seconds.

[0067] The feature extraction takes a little longer time, which will cause latency issues, but these issues can be mitigated by employing more efficient CNNs. In a practical application, only one new image is captured at each time instance, and the previously extracted spatiotemporal features can be merged with the new image to reduce the latency. Similarly, for the 2D feature extraction, only the new image needs to be processed at each time instance. It is known in the art that the Darknet- 19 can achieve a speed of 171 frames per second (5.85 ms per image) to extract 2D features on NVIDIA Titan X GPU. 2D and 3D features extractions can be conducted simultaneously which means that the longer latter determines the whole feature extraction time. By jointly using TSN and Darknet-19, the running time of the method with 1D LSTM network can be reduced to less than 15.5+16 = 31.5 ms on V100 GPU, which is more powerful than P100 and Titan X.

[0068] For practical scenarios, the RGB images and their corresponding beam indices can be obtained from cameras installed on the BS and the classic beamforming algorithm, respectively. After obtaining sufficient data for the training set, the proposed new network will be pre-trained on these data and then run in the processors of the BS for beamforming. At the beginning of the serving time, the first eight beam indices can be estimated by the classic beamforming algorithm. Then, the eight pairs of images and beam indices are sent to the processors for future beam predictions. Notably, after the first eight beams, the subsequent beams will be predicted by using previously-obtained images and beam indices. These predicted beam indices and their corresponding images can be added to the training set to enlarge the dataset and improve the performance.

[0069] Many state-of-the-art DL techniques have been proved efficient and powerful in CV, such as reinforcement learning, encoder-decoder architecture, generative adversarial networks (GANs), Transformer, graph convolutional networks (GCNs), etc. Reinforcement learning has been widely applied in tackling optimization problems in wireless communications. GCNs can be leveraged to address network- related issues, and encoder-decoder architecture is widely used in semantic segmentation and sequence-to-sequence tasks. The GAN is a powerful CNN to learn the statistics of training data and has been widely used to improve the performance of other DL networks in CV [11] Transformer built on the attention mechanisms is a kind of the encoder-decoder architecture that can handle unordered sequences of data. Much CV research has shown that if these techniques are jointly applied to make full use of the visual data, better results can be obtained. [0070] Thus, a single proper CV technique or an adequate combination of several CV techniques can be used with the novel methods to handle a specific problem in wireless systems. The system shown in Figure 1 used a combination of ResNet, 3D ResNext, and an LSTM network to achieve the required performance. However, other efficient CV techniques can be used.

[0071]As many kinds of cameras and LIDARs operate in real life, an enormous amount of visual data can be obtained through them, for more accurate motion and position information with regard to the user terminals that can be recognized, analyzed, and extracted from these multimedia data, which can also be explored to design and optimize wireless communications. For example, as shown in Figure 13, for a cellular network, visual data 102 obtained at the BS 168 may contain the locations, number, and motion information 303, 305 of the user terminals that operate in the open area associated with the BS and associated objects. This information is extracted by the block 104 and can be used by the BS to adjust its transmitting power 1310, beam direction 1312 to save power consumption and reduce interference, and also to aid with the handover 1314 to or from another BS. [0072] Figure 14 presents a real-life example: the motion information of the user terminals 180-I in the coverage area 1410 of a BS 168 can be utilized to forecast the future positions of these terminals and judge whether/when a terminal (e.g., terminal 180-2) goes out or a new one (e.g., 180-3) comes into its serving area. Then, the transmit power and beam direction for these terminals can be accurately assigned for these users who still stay in the coverage area, and channel resource allocation can be set up for the handover process to improve the utilization efficiency of the system resource.

[0073] In another application, the novel methods discussed above may be implemented for a Vehicle-to-Everything Communication system. For this scenario, visual data captured by one vehicle can reveal its environments, such as traffic conditions, which can be used to set up links with neighboring terminals, access points, and vehicles. Therefore, traffic schedules and jam/accident alarms can be conducted for improved road safety, traffic efficiency, and energy savings. As depicted in Figure 15, the BS 168 is located on the side of the road 206 and utilizes the visual information obtained from the camera 162 to estimate a distance 1510 to a vehicle terminal 180-I, at various times, and adjusts its transmit power 1520 based on the actual distance between the BS and the vehicle terminal for power saving and interference deduction purposes. Moreover, the images or videos captured by the cameras 1530, which are located on the vehicles 180-I, can detect a traffic jam or accident 1532, and then forwards the observed traffic information 1534 to a traffic control center (not shown), via the BS or other means.

[0074] In yet another application, the novel methods discussed above may be used for unmanned aerial vehicle (UAV)-Ground communications. When a UAV serves as an aerial BS, visual data captured by the UAV can be used to identify the locations and distribution of ground terminals, which can be utilized in power allocation, route/trajectory planning, etc. Moreover, when a ground BS 168 communicates with several UAVs, visual data captured by the ground terminal can be used to define the serving range, allocate the channels/power, and so forth. For example, Figure 16 illustrates system 1600 in which a group of UAVs 1602-I communicates with a set of ground terminals 168. The head UAV 1602 first takes an image of the whole area 1610 and detects all the terminals 168. Then the serving area 1610 is divided into several subareas 1612-I. Each UAV 1202-I serves one specific subarea 1612-I and designs a route schedule 1614-I according to the location information of these ground terminals obtained from the corresponding images.

[0075] In another application, visual data captured by satellites or airborne crafts can be applied to recognize and analyze the user’s distribution and schedule power budget/serving ranges to achieve optimal energy efficiency, thus achieving a “smart city.” In yet another application, intelligent reflecting surfaces (IRSs) may be used with the novel methods discussed above. More specifically, implementing channel estimation and achieving network state information at an IRS is impossible because there is no comparable calculation capacity and no radio frequency (RF) signal transmitting or receiving capabilities at the IRS. However, DL-based CV can offer useful information to compensate for this gap. Thus, a proper control matrix can be optimally designed to accurately reflect the incident signals to the target destination by utilizing the visual data captured by the camera 1710 installed on the IRS 1702. The visual data includes the locations, distances, and the number of terminals 180-I, as shown in Figure 17.

[0076] A method for calculating beam indices in a wireless communication system, based on the deep-learning network shown in the previous figures is now discussed with regard to Figure 18. The method includes a step 1800 of receiving n images of a target user terminal and objects around the target user terminal, where n is a positive integer, a step 1802 of extracting visual features and motion features associated with the target user terminal and the objects around the target user terminal, a step 1804 of generating, in a given cell of a deep-learning network, an output vector o, based on (1) merged features, obtained from the visual features and the motion features,

(2) an initial state received from a previous cell of the deep-learning network, and (3) a beam index b_i, wherein j = i + 1 and i and j are positive integers, and a step 1806 of calculating a future beam index l_m for the target user terminal for a future time, wherein m is a positive integer. The future beam index l_m is calculated to avoid obstacles between an antenna of the wireless communication system and target user terminal at the future time.

[0077] The method may further include a step of applying an argmax layer to the output vector O_j to calculate the future beam index l_m, only for j equal to or larger than n, and/or a step of directing a future beam of the wireless communication network along a direction embedded into the future beam index l_m. In one application, the n = 8, i can take a value between 1 and 8, and m can take a value between 9 and 13, and j takes a value between 2 and 13. In this or another application, i takes a value up to and including n, and m takes a value between n+1 and any larger positive integer larger than n+1.

[0078] The visual features include at least one of a position of the target user terminal, positions of other user terminals, a number of the other user terminals, a position of the objects around the target user terminal, a height of the objects around the target user terminal. The motion features include at least one of a speed or acceleration or direction of the target user terminal, the other user terminals, and the objects around the target user terminal. The deep-learning network includes plural given recurrent neural network cells, the first n-1 given recurrent neural network cells being configured to update a hidden state of the deep-learning network and the remaining given recurrent neural network cells being configured to generate the future beam indices l_m. In one application, the deep-learning network includes plural given recurrent neural network cells forming a one-dimensional network. In another application, the deep-learning network includes plural given recurrent neural network cells forming a two-dimensional network. The method may further include a step of calculating a power lever for the future beam index l_m based on the merged features. [0079] The above-discussed procedures and methods may be implemented in a computing device as illustrated in Figure 19. Hardware, firmware, software or a combination thereof may be used to perform the various steps and operations described herein. Computing device 1900 suitable for performing the activities described in the exemplary embodiments may include a server 1901. Such a server 1901 may include a central processor (CPU) 1902 coupled to a random access memory (RAM) 1904 and to a read-only memory (ROM) 1906. ROM 1906 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 1902 may communicate with other internal and external components through input/output (I/O) circuitry 1908 and bussing 1910 to provide control signals and the like. Processor 1902 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions. The server 1901 may be part of the communication network 160 or may just interact with such network.

[0080] Server 1901 may also include one or more data storage devices, including hard drives 1912, CD-ROM drives 1914 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD-ROM or DVD 1916, a USB storage device 1918 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1914, disk drive 1912, etc. Server 1901 may be coupled to a display 1920, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 1922 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.

[0081] Server 1901 may be coupled to other devices, such as base stations or other communication stations, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1928, which allows ultimate connection to various landline and/or mobile computing devices. [0082] The disclosed embodiments provide a communication system enhanced with DL-based CV system, which is capable of monitoring the various user terminals and adjusting the direction and/or power of the beams sent by the base station. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details. [0083] Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein. [0084] This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.

References

The entire content of all the publications listed herein is incorporated by reference in this patent application.

[1] M. Alrabeiah, A. Hredzak, Z. Liu, and A. Alkhateeb, “ViWi: A deep learning dataset framework for vision-aided wireless communications,” 2019. [Online]

Available: arXiv: 1911.06257. [2] M. Alrabeiah, J. Booth, A. Hredzak, and A. Alkhateeb, “ViWi vision-aided mmWave beam tracking: Dataset, task, and baseline solutions,” 2020. [Online] Available: arXiv:2002.02445.

[3] A. Klautau, P. Batista, N. Gonzalez-Prelcic, Y. Wang, and R. W. Heath, “5G MIMO data for machine learning: Application to beam-selection using deep learning,” in Proc. Inf. Theory Appl. Workshop (ITA), 2018, pp. 1-9.

[4] W. Xu, F. Gao, S. Jin, and A. Alkhateeb, “3D scene based beam selection for mmWave communications,” 2019. [Online] Available: arXiv:1911.08409.

[5] M. Alrabeiah, A. Hredzak, and A. Alkhateeb, “Millimeter wave base stations with cameras: Vision aided beam and blockage prediction,” 2019. [Online] Available: arXiv:1911.06255.

[6] S. Ayva_ssik, H. M. GQrsu, and W. Kellerer, “Veni Vidi Dixi: Reliable wireless communication with depth images,” in Proc. 15th Int. Conf. Emerg. Netw. Exp. Technol., 2019, pp. 172-185.

[7] T. Nishio et al., “Proactive received power prediction using machine learning and depth images for mmWave networks,” IEEE J. Sel. Areas Commun., vol. 37, no. 11, pp. 2413-2427, Nov. 2019.

[8] Y. Koda, K. Nakashima, K. Yamamoto, T. Nishio, and M. Morikura, “Handover management for mmWave networks with proactive performance prediction using camera images and deep reinforce-ment learning,” IEEE Trans. Cogn. Commun. Netw., vol. 6, no. 2, pp. 802-816, Jun. 2020. [9] Y. Koda et al., “Communication-efficient multimodal split learning for mmWave received power prediction,” IEEE Commun. Lett., vol. 24, no. 6, pp. 1284-1288, Jun. 2020.

[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735-1780, 1997.

[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA,

USA: MIT Press, 2016.

Claims

WHAT IS CLAIMED IS:

1. A method for calculating beam indices in a wireless communication system, the method comprising: receiving (1800) n images of a target user terminal (180-2) and objects (182, 204) around the target user terminal (180-2), where n is a positive integer; extracting (1802) visual features (303) and motion features (305) associated with the target user terminal (180-2) and the objects (182, 204) around the target user terminal (180-2); generating (1804), in a given cell (800-I) of a deep-learning network (300), an output vector O_j based on (1) merged features (320), obtained from the visual features (303) and the motion features (305), (2) an initial state (610) received from a previous cell (800-(l-1)) of the deep-learning network (300), and (3) a beam index b_i, wherein j = i + 1 and i and j are positive integers; and calculating (1806) a future beam index lm for the target user terminal (180-2) for a future time, wherein m is a positive integer, wherein the future beam index l_m is calculated to avoid obstacles between an antenna (166) of the wireless communication system (160) and the target user terminal (180-2) at the future time.

2. The method of Claim 1 , further comprising: applying the output vector O_j to an argmax layer to calculate the future beam index l_m, only for j equal to or larger than n.

3. The method of Claim 2, further comprising: directing a future beam of the wireless communication system along a direction embedded into the future beam index I_m.

4. The method of Claim 1, wherein n = 8, i takes a value between 1 and 8, m takes a value between 9 and 13, and j takes a value between 2 and 13.

5. The method of Claim 1, wherein i takes a value up to and including n, and m takes a value between n+1 and any positive integer larger than n+1.

6. The method of Claim 1, wherein the visual features include at least one of a position of the target user terminal, positions of other user terminals, a number of the other user terminals, a position of the objects around the target user terminal, a height of the objects around the target user terminal.

7. The method of Claim 6, wherein the motion features includes at least one of a speed or acceleration or direction of the target user terminal, the other user terminals, and the objects around the target user terminal.

8. The method of Claim 1, wherein the deep-learning network includes plural given recurrent neural network cells, the first n-1 given recurrent neural network cells being configured to update a hidden state of the deep-learning network and the remaining given recurrent neural network cells being configured to generate the future beam indices l_m.

9. The method of Claim 1, wherein the deep-learning network includes plural given recurrent neural network cells forming a one-dimensional network.

10. The method of Claim 1, wherein the deep-learning network includes plural recurrent neural network given cells forming a two-dimensional network.

11. The method of Claim 1 , further comprising: calculating a power lever for the future beam index l_m based on the merged features.

12. A computing device (1900) for calculating beam indices in a wireless communication system, the computing device (1900) comprising: an interface (1910) configured to receive (1800) n images of a target user terminal (180-2) and objects (182, 204) around the target user terminal (180-2), where n is a positive integer; and a processor (1902) connected to the interface (1910) and configured to, extract (1802) visual features (303) and motion features (305) associated with the target user terminal (180-2) and the objects (182, 204) around the target user terminal (180-2); generate (1804), in a given cell (800-I) of a deep-learning network (300), an output vector O_j based on (1) merged features (320), obtained from the visual features (303) and the motion features (305), (2) an initial state (610) received from a previous cell (800-(l-1)) of the deep-learning network (300), and (3) a beam index bj, wherein j = i + 1 and i and j are positive integers; and calculate (1806) a future beam index l_m for the target user terminal (180-2) for a future time, wherein m is a positive integer, wherein the future beam index l_m is calculated to avoid obstacles between an antenna (166) of the wireless communication system (160) and target user terminal (180-2) at the future time.

13. The computing device of Claim 12, wherein the processor is further configured to: apply the output vector O_j to an argmax layer to calculate the future beam index l_m, only for j equal to or larger than n.

14. The computing device of Claim 13, wherein the processor is further configured to: direct a future beam of the wireless communication network along a direction embedded into the future beam index l_m.

15. The computing device of Claim 12, wherein n = 8, i takes a value between 1 and 8, m takes a value between 9 and 13, and j takes a value between 2 and 13.

16. The computing device of Claim 12, wherein the visual features include at least one of a position of the target user terminal, positions of other user terminals, a number of the other user terminals, a position of the objects around the target user terminal, a height of the objects around the target user terminal.

17. The computing device of Claim 16, wherein the motion features includes at least one of a speed or acceleration or direction of the target user terminal, the other user terminals, and the objects around the target user terminal.

18. The computing device of Claim 12, wherein the deep-learning network includes plural given recurrent neural network cells, the first n-1 given recurrent neural network cells being configured to update a hidden state of the deep-learning network and the remaining given recurrent neural network cells being configured to generate the future beam indices l_m.

19. The computing device of Claim 12, wherein the processor is further configured to: calculate a power lever for the future beam index l_m based on the merged features.

20. A non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement a method for calculating beam indices in a wireless communication system, the method comprising: receiving (1800) n images of a target user terminal (180-2) and objects (182, 204) around the target user terminal (180-2), where n is a positive integer; extracting (1802) visual features (303) and motion features (305) associated with the target user terminal (180-2) and the objects (182, 204) around the target user terminal (180-2); generating (1804), in a given recurrent neural network cell (800-I) of a deeplearning network (300), an output vector o, based on (1) merged features (320), obtained from the visual features (303) and the motion features (305), (2) an initial state (610) received from a previous recurrent neural network cell (800-(l-1)) of the deep-learning network (300), and (3) a beam index b_i, wherein j = i + 1 and i and j are positive integers; and calculating (1806) a future beam index l_m for the target user terminal (180-2) for a future time, wherein m is a positive integer, wherein the future beam index l_m is calculated to avoid obstacles between an antenna (166) of the wireless communication system (160) and target user terminal

(180-2) at the future time.