WO2023175381A1

WO2023175381A1 - Iterative training of collaborative distributed coded artificial intelligence model

Info

Publication number: WO2023175381A1
Application number: PCT/IB2022/052483
Authority: WO
Inventors: Yuxuan JIANG; Qiang Ye; Emmanuel Thepie FAPI; Wenting Sun; Fudong Li
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-09-21

Abstract

A method performed by a computing device (101, 1200) for iterative training of a collaborative distributed coded AI model is provided. The method includes receiving (409) a request from the AI model for a redundancy factor for an iteration of training of the AI model. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration. The method further includes selecting (411) the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The method further includes sending (413) the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the AI model.

Description

ITERATIVE TRAINING OF COLLABORATIVE DISTRIBUTED CODED ARTIFICIAL INTELLIGENCE MODEL

TECHNICAL FIELD

[0001] The present disclosure relates generally to iterative training of a collaborative distributed coded artificial intelligence (Al) model, and related methods and apparatuses.

BACKGROUND

[0002] In the context of collaborative distributed Al/machine learning (ML) (referred to herein as an Al model), a parameter server (e.g., a master node) and multiple Internet of Things (loT) edge devices (also referred to as worker devices) cooperatively work to complete AI/ML training. See e.g., W. Y. B. Lim et al., "Incentive mechanism design for resource sharing in collaborative learning", arXiv preprint arXiv:2016.00511, https://doi.org/10.48550/arXiv.2006.00511 (2020) (accessed on 14 March 2022).

[0003] Considering a trusted cluster of loT devices, a training dataset can be shared across the edge devices. The master node trains the Al model based on the data the master node has collected. The Al model training is assumed to converge after a certain number of iterations. In collaborative distributed Al model training, and in ideal conditions of communication and computation, each loT device is responsible of performing a portion of processing during each training iteration. The loT devices output results are then sent back to the master node for aggregation or combination.

SUMMARY

[0004] In collaborative distributed Al model training, some worker devices (e.g., loT edge devices) in a distributed computing cluster may not be reliable in computation (e.g., computing processing unit (CPU) overloaded, system failure, etc.) and communications (e.g., limited communication bandwidth, increased latency, etc.), especially in wireless communication. Thus, collaborative distributed Al learning may become challenging. Moreover, worker devices (e.g., loT edge devices) in a low coverage zone may affect such a training process as data may arrive late. Such a scenario may not be suitable for realtime applications. [0005] Worker devices may become stragglers and, thus, may have an effect of delaying the learning process. Some approaches using a distributed coded Al strategy may lack intelligence and/or an online decision during the dispatching of workload to each worker device and/or in the collection of output results by a central coordinator node. [0006] Potential advantages provided by various embodiments of the present disclosure may include that the method includes operations that may perform an online workload allocation decision to execute an iterative Al model using distributed coded Al model training. As a consequence, workloads may be intelligently assigned across worker devices in each iteration and latency may be reduced or minimized.

[0007] In various embodiments, a method performed by a computing device is provided for iterative training of a collaborative distributed coded Al model. The method includes receiving a request from the Al model for a redundancy factor for an iteration of training of the Al model. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration. The method further includes selecting the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The method further includes sending the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

[0008] In various embodiments, a computing device is provided. The computing device includes processing circuitry, and at least one memory coupled with the processing circuitry. The memory stores program code that is executed by the processing circuitry to perform operations. The operations include receive a request from the Al model for a redundancy factor for an iteration of training of the Al model. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration. The operations further include select the redundancy factor in the iteration based on use of a ML model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The operations further include send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model. [0009] In various embodiments, a computing device is provided that is adapted to perform operations comprising receive a request from the Al model for a redundancy factor for an iteration of training of the Al model. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration. The operations further include select the redundancy factor in the iteration based on use of a ML model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The operations further include send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model. [0010] In various embodiments, a computer program product including a non- transitory storage medium including program code to be executed by processing circuitry of a computing device is provided. Execution of the program code causes the computing device to perform operations comprising receive a request from the Al model for a redundancy factor for an iteration of training of the Al model. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration. The operations further include select the redundancy factor in the iteration based on use of a ML model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The operations further include send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

[0011] In various embodiments, a computer program including program code to be executed by processing circuitry of a computing device is provided. The program code causes the computing device to perform operations comprising receive a request from the Al model for a redundancy factor for an iteration of training of the Al model. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration. The operations further include select the redundancy factor in the iteration based on use of a ML model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The operations further include send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

[0012] In various embodiments, a method performed by a master node in a distributed computing cluster for iterative training of a collaborative distributed coded Al model is provided for iterative training of a collaborative distributed coded Al model. The method includes receiving, from a computing device, a redundancy factor. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration. The method further includes receiving a data matrix and a vector from the Al model; encoding the data matrix into a plurality of submatrices; distributing a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor. The method further includes collecting respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster. The method further includes extracting an overall result from the collected respective results; and sending the overall result to the Al model to determine whether the training is completed.

[0013] In various embodiments, a master node is provided. The master node includes processing circuitry, and at least one memory coupled with the processing circuitry. The memory stores program code that is executed by the processing circuitry to perform operations. The operations include receive, from a computing device, a redundancy factor. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration. The operations further include receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor. The method further includes collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster. The method further includes extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

[0014] In various embodiments, a master node is provided that is adapted to perform operations comprising receive, from a computing device, a redundancy factor.

The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration. The operations further include receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor. The method further includes collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster. The method further includes extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

[0015] In various embodiments, a computer program product including a non- transitory storage medium including program code to be executed by processing circuitry of a master node is provided. Execution of the program code causes the master node to perform operations comprising receive, from a computing device, a redundancy factor.

The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration. The operations further include receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor. The method further includes collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster. The method further includes extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed. [0016] In various embodiments, a computer program including program code to be executed by processing circuitry of a master node is provided. The program code causes the master node to perform operations comprising receive, from a computing device, a redundancy factor. The redundancy factor comprises an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration. The operations further include receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor. The method further includes collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster. The method further includes extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

BRIEF DESCRIPTION OF DRAWINGS

[0017] The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate certain non-limiting embodiments of inventive concepts. In the drawings:

[0018] Figure 1 is a schematic diagram of an overview of distributed coded Al model learning in accordance with some embodiments of the present disclosure;

[0019] Figure 2 is a schematic diagram illustrating a cloud-based implementation as a service in accordance with some embodiments of the present disclosure;

[0020] Figure 3 is a signalling diagram in accordance with some embodiments of the present disclosure;

[0021] Figure 4 is a flow chart illustrating operations of a computing device in accordance with some embodiments of the present disclosure;

[0022] Figure 5 is a flow chart illustrating operations of a master node in accordance with some embodiments of the present disclosure; [0023] Figure 6 is a plot of empirical average execution time for each arm for a simulation in accordance with some embodiments of the present disclosure;

[0024] Figures 7A-7D are plots of per-iteration reward evolution for the simulation in accordance with some embodiments of the present disclosure;

[0025] Figures 8-11 are plots of numbers of pulls for each arm in the simulation in accordance with some embodiments of the present disclosure;

[0026] Figure 12 is a block diagram of a computing device in accordance with some embodiments of the present disclosure;

[0027] Figure 13 is a block diagram of a master node in accordance with some embodiments of the present disclosure; and

[0028] Figure 14 is a block diagram of a worker device in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION

[0029] Inventive concepts will now be described more fully hereinafter with reference to the accompanying drawings, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of present inventive concepts to those skilled in the art. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in another embodiment.

[0030] The following description presents various embodiments of the disclosed subject matter. These embodiments are presented as teaching examples and are not to be construed as limiting the scope of the disclosed subject matter. For example, certain details of the described embodiments may be modified, omitted, or expanded upon without departing from the scope of the described subject matter.

[0031] The following explanation of potential problems with some approaches is a present realization as part of the present disclosure and is not to be construed as previously known by others. [0032] As previously referenced, in collaborative distributed Al model training, some loT edge devices in a cluster may not be reliable in computation (e.g., CPU overloaded, system failure, etc.) and communications (e.g., limited communication bandwidth, increased latency, etc.), especially in wireless communication. Due to a heterogeneous and time-varying nature of loT edge devices availability, collaborative distributed Al learning may become challenging.

[0033] Some computing, e.g., edge computing, includes distributed computing and data storage for services with low latency requirements to help enable ultra-fast interactions and/or responsiveness. Resources may be unbalanced in an edge computing scenario and edge devices may be located in different fifth generation (5G) coverage zones, such as high, low, or medium coverage zones. Latency for edge devices in a low coverage zone may be higher than latency for edge devices in a high coverage zone.

Thus, edge devices in the low coverage zone may affect the training process as data may arrive late. Such a scenario may not be suitable for real-time applications.

[0034] Some approaches have used distributed coded techniques to address such a scenario. These techniques may allow injection of erasure and error-correcting codes to improve the reliability via coded computation. This injection is achieved by intelligently adding some redundancy to the data assigned to loT edge devices for a subtask in each iteration.

[0035] In an effort to mitigate the effects of possible straggler worker devices (e.g., due to limited communication bandwidth and/or increased latency), some distributed coded Al approaches can increase the computational workload overhead assigned to the worker devices. Some additional challenges associated with increased workload may include system disturbances, such as slow-down or failures of an individual worker device(s).

[0036] As a consequence, worker devices exposed to such issues may become stragglers and, thus, may have an effect of delaying the learning process. Such worker devices may also slow down the time to achieve convergence, lower the accuracy, and may lead to challenging analysis and debugging. Benefits of a distributed coded Al strategy over uncoded implementation, however, may be restricted by lack of intelligence and lack of an online decision during the dispatching of workload to each worker device and/or in the collection of the output results by the central coordinator or the master node.

[0037] For example, in each iteration, a data matrix-vector multiplication y = A ■ x is computed. To complete an Al learning task as soon as possible, it may be preferable to minimize the long-term computation over a certain number of iterations. If, e.g., 5,000 iterations are needed to ensure the convergence with each computation of the data matrix-vector multiplication, a target may be to minimize the overall training time over these 5,000 iterations by appropriately using the distributed coded Al cluster at the network edge. Thus, it may be desirable to inject an appropriate amount of redundancy to the computation at the worker device(s) in each iteration.

[0038] To determine the redundancy injected to the assigned workloads in distributed coded Al training, some approaches may exploit an offline model-based approach. Such approaches may assume that the worker device's communication and computation capabilities follow certain distributions or probability, such as exponential distributions. Such approaches also may assume that the parameters of the distributions are exactly known in advance, which may be unrealistic for real-word systems and applications.

[0039] For example, K. Lee et al., "Speeding up distributed machine learning using codes," IEEE Transaction on Information theory, Vol. 64, no. 3, pp. 1514-1529 (2017) considers a homogeneous cluster with Maximum Distance Separable (MDS) codes to conduct matrix-vector multiplication. D. Kim et al., "Optimal load allocation for coded distributed computation in heterogeneous clusters," IEEE Transaction on Communications, vol. 69, no. 1, pp. 44-58 (2021) considers a heterogeneous cluster with MDS code for matrix-vector multiplication. A. Reisizadeh et al., "Coded computation over heterogeneous cluster," IEEE Transaction on Information Theory, vol. 65, no. 7, pp. 4227- 4242 (2019) considers a heterogeneous cluster with Random Linear Codes (RLC) for matrix-multiplication.

[0040] Such approaches assume that the workload execution time on each worker device follows an exponential distribution, whose parameters are exactly known. The approaches derived a closed-form expression of the amount of redundancy to be injected to the worker devices, which may not be (e.g., cannot be) achieved in real applications. [0041] In another approach, Wenchao Xia et al., "Multi-Armed Bandit-Based Client Scheduling for Federated Learning," IEEE Transactions on Wireless Communications, vol. 19, no. 11, pp. 7108-7123 (Nov. 2020), doi: 10.1109/TWC.2020.3008091 ("Xia") experimented with a Multi-Armed Bandit algorithm in a different context. Xia discusses the effectiveness of the framework for online client scheduling in Federated Leaning without knowing wireless channel state information and statistical characteristics of clients.

[0042] Various embodiments of the present disclosure may provide solutions to these and other potential problems. A method is provided for systems (e.g., large-scale systems) where collaborative distributed Al model learning performance may need to be robust against disturbances such as straggler worker devices, system failures, communication issues, etc. In each iteration of a collaborative distributed learning Al model, the method includes a data matrix-vector multiplication as part of a building block. The data matrix-vector multiplication is computed at a computing device (e.g., a computing device (which, in some embodiments, may be a master node) based on the outputs of worker devices in a distributed computing cluster.

[0043] The Al model may include one of independent component analysis (ICA), principal component analysis (PCA), a convolutional neural network (CNN), and a deep neural network (DNN). Additionally, linear transformation may be included in signal processing, including any iterative intensive computation as class of processing .

[0044] Potential technical advantages provided by various embodiments of the present disclosure may include that based on the coded distributed Al model deciding an amount of redundancy to be injected in each iteration, in real time applications, the method may reduce effects of disturbances when, instead, an assumption on a worker device's capability is used. Additionally, when a MAB based decision framework is included in the method, the decision may be a model free plug-and-play (e.g., online, real-time) decision on the amount of redundancy to be injected in each iteration of the distributed coded Al model training. As a consequence, the method may allow selection of a reliable subset of worker devices for a real-world distributed coded AL model training system such as to minimize the training time in each iteration; and the online decision framework may help make online workload allocation decisions to execute the iterative Al model using distributed coded Al training. [0045] Additional potential technical advantages based on the method deciding the amount of workload to be assigned to respective worker devices may include the following:

[0046] Avoidance of lost Al model updates and minimize the latency: If workload is not intelligently assigned across worker devices in each iteration, the overall training may be affected by delay caused by overloaded processing of some worker devices. Thus, model aggregation or reconstruction may not be achieved on time. Application performance also may decrease and may not fit a real-time process. The method may reduce (e.g., significantly reduce) the convergence time of the Al model as the master node does not need to wait for the slowest worker device(s) responses.

[0047] Model Free Approach: Deployment of the method is not restricted to a particular Al model. The method may be suitable for real-word application where most of the processes are stochastic. In the real-word, worker devices can dynamically join and leave the computing cluster. When a worker device joins the computing cluster, it may not be practical to require the worker device to report its parameters (e.g., communication, computation capability, and reliability).

[0048] Online Decision as Plug-and-Play: The method may be more practical than some approaches discussed herein based on a master node treating the performance of worker devices as a black box. As soon a distributed computing cluster is formed, the master node (also referred to herein as a central coordinator) may start to work with the distributed computing cluster and figure out the worker devices' capability and reliability by itself. The redundancy factor in each iteration is updated or chosen from an available set (e.g., using a MAB based framework).

[0049] Self-Organized and Intelligent Central Coordinator: The method may allow deployment of a master node that orchestrates the processing itself without additional external intervention.

[0050] Stragglers Mitigation in Distributed Computing: Based on the method allowing, in each training iteration, an intelligent workload allocation and data collection communication bottlenecks, system disturbances, and node failures in distributed Al model training may be efficiently addressed. [0051] Reduction of Energy Consumption: Upon reception of a number of outputs (e.g., decided by an encoding algorithm and MAB arm), the master node does not need to wait. As a consequence, the computing capacity of the worker devices may be optimally used and, thus, power consumption may be reduced.

[0052] Cloud based implementation as a service: The method may be generalized to various types of collaborative computing applications as a service. Thus, a subscriber of such a service may see its master node assisted by the method (e.g., especially for latency critical applications).

[0053] Real-Time Applications: The method may include an online decision. As such, the method may be suitable for real-time applications (e.g., where latency and task scheduling are critical).

[0054] Generalization and Application Specific: The method may be deployed based on any type of existing MAB algorithm. Thus, for each application and depending on resources, the method may be generalized to any type of application and any type of MAB algorithm.

[0055] Embodiments of the present disclosure include a homogeneous computing cluster of worker devices. The worker devices have the same statistical characteristics in terms of their computation capability and reliabilities; and these worker devices remain in the cluster during the iterative Al learning process.

[0056] Figure 1 is schematic diagram illustrating an overview of distributed coded Al model learning with a homogeneous distributed computing cluster of worker devices 103a. . . 103n executing an iterative Al model in accordance with some embodiments of the present disclosure. Al model 105 is iteratively trained based in a data matrix-vector multiplication.

[0057] As illustrated in Figure 1, N homogeneous worker devices 103 are connected to a backend master node 101 (also referred to as a central coordinator 101). Master node 101 also may be a computing device for the method, as discussed further herein. Master node 101 and the worker devices 103a. . . 103n form a distributed computing cluster. Iterative Al model 105 is run on the distributed computing cluster. In each iteration, labeled "m”, a data matrix-vector multiplication y = A ■ x is computed, where y is a result, A is a data matrix having co rows and b columns, x is a vector, y e fRL^Aco, l e ]R^A (&J ■ b), and x e ]R^AZ). In each iteration, a> and b remain unchanged, k is a redundancy factor (also referred to as a recovery threshold) that is a decision variable per iteration of training of Al model 105. The redundancy factor k identifies an amount of workload to be assigned per worker device 103 of the distributed computing cluster in the iteration.

[0058] MDS coding is a widely adopted linear block coding technique. See e.g., K. Lee et al., "Speeding up distributed machine learning using codes", IEEE Transaction on Information theory, Vol. 64, no. 3, pp. 1514-1529 (2017); N. Ding et al, "Optimal incentive and load design for distributed coded machine learning", IEEE Journal on Selected Areas in Communications, Vol. 39, no. 7, pp. 2090-2104 (2021) ("Ding"); R. Singleton, "Maximum distance Q-nary codes", IEEE Transactions on Information Theory, vol. 10, no. 2, pp. 116-118 (1964). Given the redundancy factor k, where 1 < k < N, k G Z, a submatrix A_n G ]R^A(Z ■ b) is assigned to worker device 103 n E N = {1, 2, 3, . . , N}, where I = a>/k and I G Z. The A„ submatrices may be generated by coding theoretic techniques.

[0059] Still referring to Figure 1, in order for master node 101 to recover the computation result y, master node 101 collects local computation results from a subset of the N worker devices.

[0060] Considering an index set of the iterations M={1,2,3,..,M}, and assuming that the iterative Al model 105 takes M iterations to converge, an objective of the method of the present disclosure may be to minimize the overall execution time of this learning by determining an appropriate k value in each iteration m EM.

[0061] A reliability-workload trade-off of the determined value of k may be illustrated as follows: A smaller k value translates to a smaller number of local work devices' 103 computation results needed to construct a final, overall computation y. Thus, a higher system reliability may result as the master node 101 relies on a fewer number of worker devices 103. Thus, master node 101 may tolerate a larger number of malfunctioning worker devices 103. However, a smaller value k also leads to a larger I value and a higher row dimension in the submatrix A„ that is assigned to each worker device 103. Overall, each worker device, thus, may need to tackle a higher computation workload. An appropriate k value may balance the reliability-workload trade-off. [0062] Figure 2 is schematic diagram illustrating a cloud-based implementation as a service in accordance with some embodiments of the present disclosure. The cloud-based infrastructure may include a cloud based computing device 201 communicatively connected a network (e.g., 5G network 203). The network includes communication connections to a plurality of distributed computing clusters 207a/205a - 207n/205n, each including a master node 207 and a plurality of worker devices 205 (one or more of which may be generally referred to as work device 205). The implementation includes a trusted edge cloud where security and confidentiality are included.

[0063] In the example of Figure 2, the cloud-based implementation includes 5G network 203 that includes an access network, such as a radio access network (RAN) and a core network (not illustrated) which includes one or more core network nodes. The access network may include one or more access network nodes, such as master nodes 207a. . . 207n (one or more of which may be generally referred to as master node 207), or any other similar 3rd Generation Partnership Project (3GPP) access node or non-3GPP access point. The master nodes 207 facilitate direct or indirect connection of worker devices 205, such as by connecting worker devices 205 to the 5G network 203 over one or more wireless connections.

[0064] Example wireless communications over a wireless connection include transmitting and/or receiving wireless signals using electromagnetic waves, radio waves, infrared waves, and/or other types of signals suitable for conveying information without the use of wires, cables, or other material conductors. Moreover, in different embodiments, the cloud-based implementation may include any number of wired or wireless networks, master nodes, computing devices, worker devices, and/or any other components or systems that may facilitate or participate in the communication of data and/or signals whether via wired or wireless connections. The cloud-based implementation may include and/or interface with any type of communication, telecommunication, data, cellular, radio network, and/or other similar type of system. [0065] As a whole, the communication systems of Figures 1 and/or 2 enable connectivity between the worker devices, master nodes, and computing devices. In that sense, the communication systems may be configured to operate according to predefined rules or procedures, such as specific standards that include, but are not limited to: Global System for Mobile Communications (GSM); Universal Mobile Telecommunications System (UMTS); Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, 5G standards, or any applicable future generation standard (e.g., 6G); wireless local area network (WLAN) standards, such as the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (WiFi); and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave, Near Field Communication (NFC) ZigBee, LiFi, and/or any low-power wide-area network (LPWAN) standards such as LoRa and Sigfox.

[0066] As discussed previously herein, a MAB model may be used in the method of the present disclosure. A MAB may be used to model a tradeoff faced by automated Al model. The MAB may aim to gain new knowledge by exploring its environment, and to exploit its current, reliable knowledge. MAB problems may be a class of partial-information sequential resource allocation problems concerned with allocating between multiple options, where the benefits of each option are not known at the time of allocation. A benefit may be discovered as time passes and resources are reallocated. The name "MAB" refers to a visualization of this problem.

[0067] In a MAB game, the available parameters to the forecaster are the number of arms (or actions) K and the number of rounds n, unknown to the forecaster. A gain vector g_t = g_{l t} g_{k t}) at each round t may be generated as follows:

For each round t=l,2,....,n

• The forecaster chooses an arm l_t E (1, ... , K)

• The forecaster received the gain g_t

• Only g_t is revealed to the forecaster

[0068] A cumulative regret goal may be to maximize the cumulative gains obtained. In an example embodiment, the goal is to minimize: (Equation 1)

Where the expectation E comes from both a possible stochastic generation of the gain vector and a possible randomization in the choice of l_t.

[0069] In a MAB game, the environment is stochastic. The gain vector g_t is sampled from an unknown product distribution x ... .x v_K on [0, 1]^K that is g_{i t} « v_t. Also, the environment is adversarial in the way that the gain vector g_t is chosen by an adversarial (which at time t, knows all the past, but not l_t. There may be variants of the MAB bandit problem, as well as multiple applications.

[0070] With a stochastic MAB bandit game introduced by Robbins, the unknown parameters to the forecaster are the reward distributions Vi, ... , v_K of the arms (with respective mean g ...,g_K). The algorithm may be deployed as follows:

For each round t = 1, 2, ..., n

• The forecaster chooses an arm l_t E (1,

• The environment draws the gain vector g_t = g_liti g_{K t}) according to

• The forecaster receives the gain g_t

Where:

[0071] The cumulative regret may be given by:

(Equation 2)

[0072] The goal may be to minimize the expected cumulative regret:

(Equation 3)

[0073] In some embodiments, the distributed coded Al model training is mapped to a MAB algorithm. Four features may delimit the MAB problem within the general class of stochastic control problems:

• Only one arm operates at each instant time. The evolution of the arm that is being operate is uncontrolled. The forecaster chooses which arm to operate but not how to operate it.

• Arms that are not operated remain frozen

• Arms are independent

• Frozen arms contributed no reward

[0074] In such a problem, there is a set of arms, each of which when played or pulled by the forecaster yields some reward, depending on its internal state which evolve stochastically over time. Such elements put together are suitable for distributed coded Al model training. [0075] Since l=ω /k is integer, all possible selections of k (an arm (or in other words, the redundancy factor)) can be retrieved in a set of K = {k₁ k₂, k₃,.., k₁ }. The index set may be taken as: J = {1, 2, 3, . . , I}. If a particular k_i ∈ K is selected, the overall execution time, from when the submatrices A„ start to be distributed to worker devices to when the master node obtains k_i local computation results from the worker devices is denoted by Φi.

[0076] The randomness of the process may be characterized by the randomness of the computation time of each the randomness of communication time to

dispatch and x, and the upload of y_n. The execution time Φi for one iteration is also a random variable. If the selection of k in iteration m is denoted by then, the

associated execution time for this iteration is given by

[0077] For each iteration m in M, the target of this proposal scheme is to determine in order to:

(Equation 4)

[0078] The formulation in Equation 4 is suitable for an MAB model. In this analogy, % set can be viewed as the set of arms. Pulling an arm is mapped by the selection of from set in iteration m. Doing so, the computation time may

be minimized, which is analogous to get a reward . Finally, the overall target

is to maximize the total reward during the Al model training

process.

[0079] If the master node knows

, with i in J, then the optimal policy for each iteration m is given by:

(Equation 5) (Equation 6)

[0080] In practice, the master node may not have information of either Φ(k_i) or , i in 3 when the distributed computing cluster is just formed before any

computation is executed. The master node, therefore, may allow an initialization step to try different k values and obtain the knowledge of

. This operation of the method be considered as a warm-up phase. Different warm-up phases can be designed according to the type of MAB algorithm to be deployed.

[0081] In some embodiments, the online decision is fit into the MAB algorithm. An online framework may involve sequential decision-making under uncertainty. Thus, in some embodiments, the agent or forecaster is the master node and is initially unaware of the stochastic evolution of the environment (that is, arms/redundancy factor), aims at maximizing a common objective based on the history of actions and observations.

[0082] An assumption may be included that the number of iterations that lead to the Al model convergence is large enough compared to the size of set of redundancy factor to be selected K.

[0083] An example embodiment of the method of the present disclosure is illustrated in the signalling diagram shown in Figure 3. Three components are included in Figure 3: computing device 101 which includes an MAB online decision model, distributed computing cluster 205/207, and iterative Al model 105. Each component receives a set of instructions and performs processing to output some metrics or variables. Sequences of the operations of this example embodiment include the following.

[0084] Step 0: Iterative Al model 105 signals (operation 305) a request to a master node 207 of the distributed computing cluster for a computation in each iteration of a vector-data matrix multiplication y=A-x, where matrix A and data x are available. Matrix A has constant row co and constant column b. The dimensions are w*l for y, w*b for A, and b*l for x.

[0085] Master node 207 forms (operation 303) the distributed computing cluster of available and trusted work devices 205 (e.g., loT edge devices) of size N. Master node 207 also activates or triggers the MAB online decision framework of computing device 101.

[0086] Step 1: Master node 207 passes (operation 305) to the MAB online decision model of computing device 101 the number of rows ω of the matrix and the size of the cluster N. The MAB online decision model prepares (operation 309) a list of redundancy factors K = {k₁, k₂, k₃, .. , k_I} based on ω , which are referred to herein as arms. The MAB online decision model sets (operation 311) all execution time Φ_i (mapped as a reward) and the number of times each arm is selected λ_i to zero. [0087] Step 2: At iteration m, iterative Al model 105 releases (operation 321) the matrix A and the data x and requests (operation 315) the redundancy factor from

computing device 101. Computing device 101 activates (operation 317) the MAB model to select the appropriate \ which is the arm that may lead to minimum execution

time. The MAB online decision model sends (operation 319) the selected redundancy factor to master node 207.

[0088] Step 3: Master node 207 encodes (operation 323) the matrix A into a plurality of submatrices using a coding theoretic technique such as MDS. Master

node 207 multicasts (operation 323) to each worker device 205 the pair

. After computation of the distributed local matrix-vector multiplication at each worker device 205 in the distribution, master node 207 collects (operation 323) results of a

subset of the worker devices 205 in the distribution. Master node 207 decodes (operation 323) the results and extracts the final, overall result y.

[0089] Step 4: Master node 207 passes (operation 329) the final, overall result y to iterative Al model 105. Master node 207 also consolidates (operation 323) the final execution time associated with the redundancy factor and sends (operation

327) it to computing device 101.

[0090] Step 5: Computing device 101 updates (operation 331) the parameters

and λ_i according to the algorithms used.

[0091] Step 6: Iterative Al model 105 verifies (operation 333) according to a criteria whether the Al model 105 has converged: If Al model 105 has converged, the iterative Al training ends and all resources are released (operation 335). If Al model 105 has not converged, the method moves to the next iteration (m + 1) and the method restarts at Step 2.

[0092] Figure 4 is a flowchart illustrating operations of a computing device (e.g., computing device 101) according to some embodiments of the present disclosure. The computing device can be computing device 1200 of Figure 12 (as discussed further herein) that is configured for iterative training of a collaborative distributed coded Al model (e.g., Al model 105). The method includes receiving (409) a request from the Al model for a redundancy factor for an iteration of training of the Al model. The redundancy factor comprises an amount of workload to be assigned per worker device (e.g., worker device 205a) of a distributed computing cluster (e.g., distributed computing cluster 207/205) in the iteration. The method further includes selecting (411) the redundancy factor in the iteration based on use of a ML model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors. The method further includes sending (413) the selected redundancy factor to a master node (e.g., master node 207) for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

[0093] The method may further include receiving (415), from the master node, an overall execution time for the distributed coded execution of the multiplication of the data matrix and the vector in the iteration by a subset of a plurality of worker devices in the distributed computing cluster.

[0094] The use of the ML model may comprise (i) per iteration in a set of iterations, choosing a redundancy factor from the set of redundancy factors, (ii) per iteration in the set of iterations, receiving a reward value for the chosen redundancy factor, and (ii) in the iteration, selecting the redundancy factor from the set of redundancy factors that has a highest reward value. The reward value has an inverse relationship to an overall execution time for the distributed coded execution of the multiplication of the data matrix and the vector in the iteration.

[0095] In some embodiments, the method further includes receiving (401), from the master node, a first parameter defining a size of the distributed computing cluster; receiving (403) from the Al model a second parameter defining a number of rows in the data matrix; and identifying (405) the set of redundancy factors based on the number of rows in the data matrix.

[0096] The method may further include initializing (407) values of a plurality of parameters in the ML model to zero. The plurality of parameters may comprise (i) a number of times that redundancy factors are selected from the set of redundancy factors, and (ii) an average reward value of the selected redundancy factors.

[0097] The method may further include updating (417) the ML model with (i) the number of times that redundancy factors are selected, and (ii) the average reward value of the selected redundancy factors where the average reward value has an inverse relationship with the received overall execution time.

[0098] The selecting (413) may include an online decision that selects the redundancy factor.

[0099] The selected redundancy factor may be suitable for a mission critical operation.

[00100] The ML model may comprise a MAB model.

[00101] The plurality of worker devices may comprise a plurality of Internet of

Things, loT, edge computing devices.

[00102] Operations 401-407 and 415-417 from the flow chart of Figure 4 may be optional with respect to some embodiments of computing devices and related methods. [00103] Figure 5 is a flowchart illustrating operations of a master node (e.g., master node 207) according to some embodiments of the present disclosure. The master node may be master node 1300 of Figure 13 (as discussed further herein). The master node is in a distributed computing cluster for iterative training of a collaborative distributed coded Al model (e.g., Al model 105). The method includes receiving (501), from a computing device (e.g., computing device 101), a redundancy factor. The redundancy factor comprises an amount of workload to be assigned per worker device (e.g., worker device 205a) of the distributed computing cluster in an iteration. The method further includes receiving (503) a data matrix and a vector from the Al model; encoding (505) the data matrix into a plurality of submatrices; and distributing (507) a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor. The method further includes collecting (509) respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster; extracting (511) an overall result from the collected respective results; and sending (513) the overall result to the Al model to determine whether the training is completed.

[00104] The method may further include identifying (515) an overall execution time for the distributed coded execution of the multiplication of the respective submatrix and the vector in the iteration of the training of the Al model by the respective worker devices in the subset of worker devices in the distributed computing cluster.

[00105] In some embodiments, the method further includes sending (517) the overall execution time to the computing device.

[00106] Operations 515-517 from the flow chart of Figure 5 may be optional with respect to some embodiments of master nodes and related methods.

[00107] In example embodiments, the MAB online decision model may be, without limitation, an ε-greedy or a upper confidence bound 1 (UCB1) model. A workflow of the e- greedy and/or UCB1 may be summarized in the following algorithm:

Algorithm: Example implementation of MAB inline decision model.

1: λ_i <- 0, i ∈ I

2: for iteration m = 1, 2, ..., / do

3: Select arm k_c_m to initialize this arm: k^m <- k_c_m.

4: Pull arm k^m. The corresponding reward (opposite of execution time) is

5: Update λ_i <— λ_i + 1, <-

6: end for

7: for iteration m = I + 1, I + 2, . . ., M do

8: Make a decision on the arm to be selected according to a certain equation.

Suppose arm k_Ci is to be selected.

9: Select arm: k^m <— k_d.

10: Pull arm k^m. The corresponding reward (opposite of execution time) is

11: Update

12: end for

[00108] In the example embodiments, when a distributed computing cluster is formed in a given iteration of the training, the above algorithm is triggered to make an online decision on the k value in the execution of the iterative Al model, as discussed below for a simulation.

[00109] Simulation Step 0: The simulation includes a distributed computing cluster of N = 500 loT edge devices (i.e., worker devices). The simulation assumed that training of the iterative Al model takes approximatively M = 5000 iterations to converge. It is noted that, in practice, there is no need to specify the number of iterations. The simulation also assumed that the number of iterations up to convergence is much larger than the set of redundancy factors (that is M » I). In the simulation, the set of redundancy factor is I = 23; ω = 10⁷ and b = 15. Then, in each iteration of the Al model training, the master node computes .

[00110] Simulation Step 1: The number of rows ω = 10⁷ of the matrix and the size of the cluster N = 500. The list of redundancy factors, therefore, is given by:

K = {1, 2, 4, 5, 8, 10, 16, 20, 25, 32, 40, 50, 64, 80, 100, 125, 160, 200, 250, 320, 400, 500} The master node triggers the example algorithm above to make an online decision. In the example algorithm, the number of pulls for arm k __c. is denoted by λ_i . In the MAB model, the reward harvested by selecting

. In the example algorithm, after initializing the number of pulls for each arm λ_i as 0 (Line 1), the first I iterations of the example algorithm are used to pull every arm once and get an initial result of its resulting performance (Line 2 to Line 5). Other variants of initialization may be used in this sequence.

[00111] Simulation Step 2: In the example algorithm, the empirical average execution time by selecting arm k __c. is stored as

After pulling each arm once, the main body of the example algorithm is performed for the remaining iterations (Line 6 to Line 10). An arm is selected in each iteration according to a certain equation in Line 7 of the example MAB algorithm.

In e-greedy, a hyperparameter e is included, where 0<e<l. In Line 7 of example algorithm, if e-greedy is implemented, the following approach for arm selection is used: _ _uation

8)

If a UCB1 algorithm is implemented, we first, a priority factor for arm in iteration

m is defined as: (Equation 9)

Where μ^m and σ^m are the mean and standard deviation of the set { and a is the

normalization hyperparameter we have introduced. Then, from the priority factor in the equation immediately above, the arm selection (j) is derived as follows: (Equation 10)

[00112] Simulation Step 3: Evaluation of the execution time. In each iteration of the Al model, the execution time on each loT device t includes three random parts: the distribution time t_d, the local computation time t_c, and the uploading time t_u. See Equation 8.

[00113] In the simulation, the following was used: U = 10⁵, 0 = 10^-5, P_distri ⁼ P_up = 0.25, V_distri = V_distri = 32 · 1.1, where a floating-point number is represented by 32 bits with an extra packet encapsulation overhead of 0.1, P_distri = 200 Mbps and P_up = 25 Mbps. Finally, the overall execution time is given by: t = t_c + t_d + t_u (Equation 11)

The following example embodiment of mapping was used in the simulation to estimate the overall execution time.

[00114] Estimation of the overall execution time: In each iteration of the Al model, the execution time on each loT device t includes three random parts: the distribution time t_d, the local computation time t_c, and the uploading time t_u. See Equation 8

[00115] (Equation 12) and (Equation 13).

In Equation 12, D_distri is the downlink distribution, (/ + 1) · b is the number of floatingpoint elements in A„ and x that need to be distributed to a worker. Also, V_distri is a coefficient that translates a floating number into its corresponding size in a data packet, and β_distri is a discrete random number that represents the number of transmissions required to successfully distribute A„ and x to a worker.

[00116] In the simulation, P_distri was considered to follow a geometric distribution given by: (Equation 14)

Where P_distri is the probability of a successful transmission. See e.g., S. Dhakal et al., Proceeding of IEEE 90^th Vehicular Technology Conference (VTC2019-Fall) (2019), pp. 1-6; "Dhakal") H. Karl and W. Willig, "Protocols and architectures for wireless senso networks", John Wiley & Sons, 2007 ("Karl"). Similar to Equation 13, β_up is the uplink bandwidth, b is the number of floating-point elements in y_n to be uploaded back to the master node, V_up is a coefficient that translated a floating-point number to its corresponding size in a data packet, and /3_up is a discrete random number that represents the number of transmission for successfully uploading y_n to the master node. The random variable /3_up follows a geometric distribution given by:

(Equation 15)

[00117] Idem, P_up is the probability of a successful transmission. See e.g., Dhakal and Karl.

[00118] The compute time t_c includes two components: the computation time t_{c l} and the memory access time t_{c 2}, see e.g., Dhakal, so that: tc = t_c,i + t_C2 (Equation 16)

[00119] The local computation time t_{c l}can be estimated as: t_{c l} = 1 - 0 (Equation 17)

Where 6 is the time required to complete the multiplication of one row in

at an loT edge device. The memory access time is a continuous random variable where its probability density function is given by:

(Equation 18)

Where y = U /I, with U being the memory access rate as described, e.g., Ding; Dhakal; W. Shi et al, "Joint device scheduling and resource allocation for latency constrained wireless federate learning," IEEE Transaction on Wireless Communications vol. 20, no. 1, pp. 453- 467 (2020).

[00120] Simulation results are now discussed, including extrapolation of simulation steps 4-6 from analysis of the simulation results.

[00121] Prior to numerical results of the MAB-based online decision models, the empirical average reward for the arms was observed by pulling each arm k __c. , i E I 1000 times and recording the empirical average reward E pt). According to the law of Large Numbers, the empirical reward after 1000 pulls can be very close to the true average reward E(tp ). Figure 6 is a plot of empirical average execution time for each arm (1000 trial per arm) for the simulation. Figure 6 illustrates that the arm 400 achieved the minimum average execution time. [00122] To validate the effectiveness of the MAB-based online decision model for the simulation, the results for the following MAB online decision models were analyzed:

• e-greedy where e=0.05

• e-greedy where e=0.01

• UCB1, where a=25

• A random benchmark, which selects from the set K uniformly at random in each iteration m e M .

[00123] The following Table 1 lists the average execution time (which is the opposite of average reward) per iteration for the four models:

[00124] As illustrated in Table 1, in the simulation, the UCB1 model achieves the best average execution time (e.g., much better that the random algorithm which did not make an intelligent decision).

[00125] The following Table 2 lists the total number of pulls for the arms in the simulation. The simulation includes 5000 iterations a set of 23 arms. Table 2 illustrates the total numbers of pull for the arms whose k_t values are no larger than 100, and the number of pulls for each individual arm whose k_t is larger than 100:

[00126] The per-iteration reward evolution for the simulation is plotted in Figures 7A-7C versus random (Figure 7D), and the number of pulls for each arm in the simulation is illustrated in the plots of Figures 8-11. Figure 7A is a plot of the simulation results for the e- greedy (e = 0.05) model; Figure 7B is a plot of the simulation results for the e-greedy (e = 0.01) model; Figure 7C is a plot of the simulation results for the UCB1 model; and Figure 7D is a plot of results from a random selection.

[00127] Figure 8 is a plot for the simulation of the number of pulls for each arm for the e-greedy (e = 0.05) model; Figure 9 is a plot for the simulation of the number of pulls for each arm for the e-greedy (e = 0.01) model; Figure 10 is a plot for the simulation of the number of pulls for each arm for the UCB1 model; and Figure 11 is a plot for the random selection of each arm.

[00128] As illustrated in Figures 7A-7D, the UCB1 model of Figure 7C performed the best in the simulation and converged within tens of iterations.

[00129] The e-greedy algorithms of Figures 7A and 7B also converge to a relatively stable arm selection with a similar number of iterations. However, due to the nature of s- greedy, as shown in Equation 8, with a small probability of E, the e-greedy randomly explores arms other that the current best one, which is a reason fluctuations are observed in Figures 7A and 7B, which are contributed by these random explorations.

[00130] In contrast, in the simulation, the UCB1 model of Figure 7C better balanced the trade-off between exploration and exploitation, as a result of the second term in Equation 9. As illustrated in Figure 7C, the UCB1 model stopped exploring the arms earlier and focused on the arms that can produce a high reward (that is, meaning with low execution times).

[00131] The random selection illustrated in Figure 7D does not show convergence and is not suitable for an online decision.

[00132] Figure 12 is a block diagram illustrating elements of a computing device 1200 (also referred to as a central node, a central coordinating node, a server, a base station, gNodeB/gNB, etc.) according to embodiments of inventive concepts. As shown, the computing device may include transceiver circuitry 1201 (also referred to as a transceiver) including a transmitter and a receiver configured to provide uplink and downlink communications with worker devices, other computing devices, etc. The computing device may include network interface circuitry 1207 (also referred to as a network interface) configured to provide communications with worker devices and other computing devices. The computing device may also include processing circuitry 1203 (also referred to as a processor) coupled to the transceiver circuitry, and memory circuitry 1205 (also referred to as memory) coupled to the processing circuitry. The memory circuitry 1205 may include computer readable program code that when executed by the processing circuitry 1203 causes the processing circuitry to perform operations according to embodiments disclosed herein. According to other embodiments, processing circuitry 1203 may be defined to include memory so that a separate memory circuitry is not required. The computing device may include a ML model 1209 (e.g., an MAB model). [00133] As discussed herein, operations of the computing device may be performed by processing circuitry 1203, ML 1209, network interface 1207, and/or transceiver 1201. For example, processing circuitry 1203 may control transceiver 1201 to transmit downlink communications through transceiver 1201 to one or more worker devices and/or master node and/or to receive uplink communications through transceiver 1201 from one or more worker devices and/or master node. Moreover, modules may be stored in memory 1205 and/or ML model 1209, and these modules may provide instructions so that when instructions of a module are executed by processing circuitry 1203, processing circuitry 1203 performs respective operations (e.g., operations discussed herein with respect to example embodiments relating to computing devices). According to some embodiments, computing device 1200 and/or an element(s)/function(s) thereof may be embodied as a virtual node/nodes and/or a virtual machine/machines.

[00134] Figure 13 is a block diagram illustrating elements of a master node 1300 (also referred to as a server, a gNodeB/gNB, base station, etc.) according to embodiments of inventive concepts. As shown, the master node may include transceiver circuitry 1301 (also referred to as a transceiver) including a transmitter and a receiver configured to provide uplink and downlink communications with other computing devices, worker devices, etc. The master node includes network interface circuitry 1307 (also referred to as a network interface) configured to provide communications with other computing devices and worker devices (e.g., with computing devices, etc.). The master node may also include processing circuitry 1303 (also referred to as a processor) coupled to the transceiver circuitry, and memory circuitry 1305 (also referred to as memory) coupled to the processing circuitry. The memory circuitry 1305 may include computer readable program code that when executed by the processing circuitry 1303 causes the processing circuitry to perform operations according to embodiments disclosed herein. According to other embodiments, processing circuitry 1303 may be defined to include memory so that a separate memory circuitry is not required. The master node may include an Al model 1309.

[00135] As discussed herein, operations of the inaccessible computing device may be performed by processing circuitry 1303, Al model 1309, network interface 1307, and/or transceiver 1301. For example, processing circuitry 1303 may control transceiver 1301 to transmit downlink communications through transceiver 1301 to one or more computing devices and/or work devices and/or to receive uplink communications through transceiver 1301 from one or more computing devices and/or worker devices. Moreover, modules may be stored in memory 1305 and/or Al model 1309, and these modules may provide instructions so that when instructions of a module are executed by processing circuitry 1303, processing circuitry 1303 performs respective operations (e.g., operations discussed herein with respect to example embodiments relating to master nodes). According to some embodiments, master node 1300 and/or an element(s)/function(s) thereof may be embodied as a virtual node/nodes and/or a virtual machine/machines.

[00136] Figure 14 is a block diagram illustrating elements of a worker device 1400 according to embodiments of inventive concepts. As shown, the worker device may include transceiver circuitry 1401 (also referred to as a transceiver) including a transmitter and a receiver configured to provide uplink and downlink communications with computing devices, master nodes, other worker devices, etc. The worker device may include network interface circuitry 1407 (also referred to as a network interface) configured to provide communications with computing devices, master nodes and/or worker devices. The worker device may also include processing circuitry 1403 (also referred to as a processor) coupled to the transceiver circuitry, and memory circuitry 1405 (also referred to as memory) coupled to the processing circuitry. The memory circuitry 1405 may include computer readable program code that when executed by the processing circuitry 1403 causes the processing circuitry to perform operations according to embodiments disclosed herein. According to other embodiments, processing circuitry 1403 may be defined to include memory so that a separate memory circuitry is not required.

[00137] As discussed herein, operations of the worker device may be performed by processing circuitry 1403, network interface 1407, and/or transceiver 1401. For example, processing circuitry 1403 may control transceiver 1401 to transmit downlink communications through transceiver 1401 to one or more computing devices, master nodes, worker devices and/or to receive uplink communications through transceiver 1401 from one or more computing devices, master nodes, and/or worker devices. Moreover, modules may be stored in memory 1405, and these modules may provide instructions so that when instructions of a module are executed by processing circuitry 1403, processing circuitry 1403 performs respective operations (e.g., operations discussed herein with respect to example embodiments relating to worker devices). According to some embodiments, worker device 1400 and/or an element(s)/function(s) thereof may be embodied as a virtual node/nodes and/or a virtual machine/machines.

[00138] The worker devices may be any of a wide variety of communication devices, including wireless devices arranged, configured, and/or operable to communicate wirelessly with the computing device 1200, the master node 1300, and other communication devices. Similarly, the master node 1300 and/pr the computing device are arranged, capable, configured, and/or operable to communicate directly or indirectly with the worker devices and/or with other computing devices, master nodes, network nodes or equipment in network to enable and/or provide communications and operations of example embodiments discussed herein with respect to worker devices.

[00139] As used herein, a worker device refers to a device capable, configured, arranged and/or operable to communicate wirelessly with computing devices, master nodes, and/or other worker devices. Examples of a worker device UE include, but are not limited to, a smart phone, mobile phone, cell phone, voice over IP (VoIP) phone, wireless local loop phone, desktop computer, personal digital assistant (PDA), wireless cameras, gaming console or device, music storage device, playback appliance, wearable terminal device, wireless endpoint, mobile station, tablet, laptop, laptop-embedded equipment (LEE), laptop-mounted equipment (LME), smart device, wireless customer-premise equipment (CPE), vehicle-mounted or vehicle embedded/integrated wireless device, etc. Other examples include any UE identified by the 3rd Generation Partnership Project (3GPP), including a narrow band internet of things (NB-loT) user equipment (UE), a machine type communication (MTC) UE, and/or an enhanced MTC (eMTC) UE.

[00140] A worker device may support device-to-device (D2D) communication, for example by implementing a 3GPP standard for sidelink communication, Dedicated Short- Range Communication (DSRC), vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), or vehicle-to-everything (V2X). In other examples, a worker device may not necessarily have a user in the sense of a human user who owns and/or operates the relevant device. Instead, a worker device may represent a device that is intended for sale to, or operation by, a human user but which may not, or which may not initially, be associated with a specific human user (e.g., a smart sprinkler controller). Alternatively, a worker device may represent a device that is not intended for sale to, or operation by, an end user but which may be associated with or operated for the benefit of a user (e.g., a smart power meter). [00141] Although the computing devices, nodes, and worker devices described herein may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices, nodes, and worker devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions, and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the computing device, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware.

[00142] In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non- transitory computer-readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device-readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer-readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device, but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally.

[00143] Further definitions and embodiments are discussed below.

[00144] In the above-description of various embodiments of the present disclosure, it is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of present inventive concepts. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which present inventive concepts belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

[00145] When an element is referred to as being "connected", "coupled", "responsive", or variants thereof to another element, it can be directly connected, coupled, or responsive to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected", "directly coupled", "directly responsive", or variants thereof to another element, there are no intervening elements present. Like numbers refer to like elements throughout. Furthermore, "coupled", "connected", "responsive", or variants thereof as used herein may include wirelessly coupled, connected, or responsive. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Well-known functions or constructions may not be described in detail for brevity and/or clarity. The term "and/or" (abbreviated "/") includes any and all combinations of one or more of the associated listed items.

[00146] It will be understood that although the terms first, second, third, etc. may be used herein to describe various elements/operations, these elements/operations should not be limited by these terms. These terms are only used to distinguish one element/operation from another element/operation. Thus a first element/operation in some embodiments could be termed a second element/operation in other embodiments without departing from the teachings of present inventive concepts. The same reference numerals or the same reference designators denote the same or similar elements throughout the specification.

[00147] As used herein, the terms "comprise", "comprising", "comprises", "include", "including", "includes", "have", "has", "having", or variants thereof are open-ended, and include one or more stated features, integers, elements, steps, components or functions but does not preclude the presence or addition of one or more other features, integers, elements, steps, components, functions or groups thereof. Furthermore, as used herein, the common abbreviation "e.g.", which derives from the Latin phrase "exempli gratia," may be used to introduce or specify a general example or examples of a previously mentioned item, and is not intended to be limiting of such item. The common abbreviation "i.e.", which derives from the Latin phrase "id est," may be used to specify a particular item from a more general recitation.

[00148] Example embodiments are described herein with reference to block diagrams and/or flowchart illustrations of computer-implemented methods, apparatus (systems and/or devices) and/or computer program products. It is understood that a block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions that are performed by one or more computer circuits. These computer program instructions may be provided to a processor circuit of a general purpose computer circuit, special purpose computer circuit, and/or other programmable data processing circuit to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, transform and control transistors, values stored in memory locations, and other hardware components within such circuitry to implement the functions/acts specified in the block diagrams and/or flowchart block or blocks, and thereby create means (functionality) and/or structure for implementing the functions/acts specified in the block diagrams and/or flowchart block(s).

[00149] These computer program instructions may also be stored in a tangible computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the functions/acts specified in the block diagrams and/or flowchart block or blocks. Accordingly, embodiments of present inventive concepts may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.) that runs on a processor such as a digital signal processor, which may collectively be referred to as "circuitry," "a module" or variants thereof.

[00150] It should also be noted that in some alternate implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Moreover, the functionality of a given block of the flowcharts and/or block diagrams may be separated into multiple blocks and/or the functionality of two or more blocks of the flowcharts and/or block diagrams may be at least partially integrated. Finally, other blocks may be added/inserted between the blocks that are illustrated, and/or blocks/operations may be omitted without departing from the scope of inventive concepts. Moreover, although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.

[00151] Many variations and modifications can be made to the embodiments without substantially departing from the principles of the present inventive concepts. All such variations and modifications are intended to be included herein within the scope of present inventive concepts. Accordingly, the above disclosed subject matter is to be considered illustrative, and not restrictive, and the examples of embodiments are intended to cover all such modifications, enhancements, and other embodiments, which fall within the spirit and scope of present inventive concepts. Thus, to the maximum extent allowed by law, the scope of present inventive concepts are to be determined by the broadest permissible interpretation of the present disclosure including the examples of embodiments and their equivalents, and shall not be restricted or limited by the foregoing detailed description.

Claims

CLAIMS:

1. A computer-implemented method performed by a computing device (101, 1200) for iterative training of a collaborative distributed coded artificial intelligence, Al, model, the method comprising: receiving (409) a request from the Al model for a redundancy factor for an iteration of training of the Al model, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration; selecting (411) the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors; and sending (413) the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

2. The method of Claim 1, further comprising; receiving (415), from the master node, an overall execution time for the distributed coded execution of the multiplication of the data matrix and the vector in the iteration by a subset of a plurality of worker devices in the distributed computing cluster.

3. The method of any of Claims 1 to 2, wherein the use of the ML model comprises (i) per iteration in a set of iterations, choosing a redundancy factor from the set of redundancy factors, (ii) per iteration in the set of iterations, receiving a reward value for the chosen redundancy factor, and (ii) in the iteration, selecting the redundancy factor from the set of redundancy factors that has a highest reward value, where the reward value has an inverse relationship to an overall execution time for the distributed coded execution of the multiplication of the data matrix and the vector in the iteration.

4. The method of any of Claims 1 to 3, further comprising: receiving (401), from the master node, a first parameter defining a size of the distributed computing cluster; receiving (403) from the Al model a second parameter defining a number of rows in the data matrix; and identifying (405) the set of redundancy factors based on the number of rows in the data matrix.

5. The method of any of Claims 1 to 4, further comprising: initializing (407) values of a plurality of parameters in the ML model to zero, the plurality of parameters comprising (i) a number of times that redundancy factors are selected from the set of redundancy factors, and (ii) an average reward value of the selected redundancy factors.

6. The method of Claim 5, further comprising: updating (417) the ML model with (i) the number of times that redundancy factors are selected, and (ii) the average reward value of the selected redundancy factors where the average reward value has an inverse relationship with the received overall execution time.

7. The method of any of Claims 1 to 6, wherein the selecting (413) comprises an online decision that selects the redundancy factor.

8. The method of Claim 7, wherein the selected redundancy factor is suitable for a mission critical operation.

9. The method of any of Claims 1 to 8, wherein the ML model comprises a multi-armed bandit model.

10. The method of any of Claims 1 to 9, wherein the plurality of worker devices comprises a plurality of Internet of Things, loT, edge computing devices.

11. A computing device (101, 1200), the computing device comprising: processing circuitry (1203); memory (1205) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the computing device to perform operations comprising: receive a request from the Al model for a redundancy factor for an iteration of training of the Al model, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration; select the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors; and send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

12. The computing device of Claim 11, the operations further comprising any of the operations of Claims 2-10.

13. A computing device (101, 1200) adapted to perform operations comprising: receive a request from the Al model for a redundancy factor for an iteration of training of the Al model, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration; select the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors; and send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

14. The computing device of Claim 13 adapted to perform operations further comprising any of the operations of Claims 2-10.

15. A computer program product comprising a non-transitory storage medium (1205) including program code to be executed by processing circuitry (1203) of a computing device (101, 1200), whereby execution of the program code causes the computing device to perform operations comprising: receive a request from the Al model for a redundancy factor for an iteration of training of the Al model, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration; select the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors; and send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

16. The computer program product of Claim 15, the operations further comprising any of the operations of Claims 2-10.

17. A computer program comprising program code to be executed by processing circuitry (1203) of a computing device (101, 1200), whereby execution of the program code causes the computing device to perform operations comprising: receive a request from the Al model for a redundancy factor for an iteration of training of the Al model, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in the iteration; select the redundancy factor in the iteration based on use of a machine learning, ML, model that selects the redundancy factor that has a lowest overall execution time from a set of redundancy factors; and send the selected redundancy factor to a master node for a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model.

18. The computer program of Claim 17, whereby execution of the program code causes the computing device to perform operations according to any of Claims 2-10.

19. A computer-implemented method performed by a master node (207, 1300) in a distributed computing cluster for iterative training of a collaborative distributed coded artificial intelligence, Al, model, the method comprising: receiving (501), from a computing device, a redundancy factor, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration; receiving (503) a data matrix and a vector from the Al model; encoding (505) the data matrix into a plurality of submatrices; distributing (507) a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor; collecting (509) respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster; extracting (511) an overall result from the collected respective results; and sending (513) the overall result to the Al model to determine whether the training is completed.

20. The method of Claim 19, further comprising: identifying (515) an overall execution time for the distributed coded execution of the multiplication of the respective submatrix and the vector in the iteration of the training of the Al model by the respective worker devices in the subset of worker devices in the distributed computing cluster.

21. The method of Claim 20, further comprising: sending (517) the overall execution time to the computing device.

22. A master node (207, 1300), the master node comprising: processing circuitry (1303); memory (1305) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the master node to perform operations comprising: receive, from a computing device, a redundancy factor, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration; receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor; collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster; extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

23. The master node of Claim 22, the operations further comprising any of the operations of Claims 20-21.

24. A master node (207, 1300) adapted to perform operations comprising: receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor; collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster; extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

25. The master node of Claim 24 adapted to perform operations further comprising any of the operations of Claims 20-21.

26. A computer program product comprising a non-transitory storage medium (1305) including program code to be executed by processing circuitry (1303) of a master node (207, 1300), whereby execution of the program code causes the master node to perform operations comprising: receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor; collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster; extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

27. The computer program product of Claim 26, the operations further comprising any of the operations of Claims 20-21.

28. A computer program comprising program code to be executed by processing circuitry (1303) of a master node (207, 1300), whereby execution of the program code causes the master node to perform operations comprising: receive a data matrix and a vector from the Al model; encode the data matrix into a plurality of submatrices; distribute a respective submatrix and the vector to a respective worker device in the distributed computing cluster according to the redundancy factor; collect respective results of a distributed coded execution of a multiplication of a respective submatrix and the vector in an iteration of the training of the Al model by respective worker devices in the subset of the worker devices in the distributed computing cluster; extract an overall result from the collected respective results; and send the overall result to the Al model to determine whether the training is completed.

29. The computer program of Claim 28, whereby execution of the program code causes the master node to perform operations according to any of Claims 20-21.

30. A system for a computer-implemented method for iterative training of a collaborative distributed coded artificial intelligence, Al, model, the system comprising: a computing device (101, 1200) comprising a machine learning, ML, model configured to (i) select a redundancy factor from a set of redundancy factors per iteration of the iterative training of the collaborative distributed Al model, the redundancy factor comprising an amount of workload to be assigned per worker device of a distributed computing cluster in an iteration, and (ii) send the selected redundancy factor to a master node; a distributed computing cluster (207, 205) comprising (i) the master node (207) and (ii) a plurality of worker devices (205) that perform a distributed coded execution of a multiplication of a data matrix and a vector in an iteration of the training of the Al model based on a distribution by the master node of the data matrix and the vector based on the received selected redundancy factor; and the collaborative distributed Al model (105, 1309) communicatively connected to the distributed computing cluster and the computing device that is trained based on the distributed coded execution of the multiplication of the data matrix and the vector.