Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Baptiste Lefebvre 1,2, Stephane Senecal 2 and Jean-Marc Kelif 2 1 École Normale Supérieure (ENS), Paris, France, baptiste.lefebvre@ens.fr 2 Orange Labs, Issy-les-Moulineaux, France stephane.senecal@orange.com, jeanmarc.kelif@orange.com First GdR MaDICS Workshop on Big Data for the 5G RAN 25 November 215 @ Huawei FRC 1/38

Agenda 1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion 2/38

Wireless Networks UE = User Equipment BS = Base Station f 4/38

Radio Resource Management (RRM) 12 subcarriers (18 khz) slot (.5 ms)...... PRB 1 PRE 2 Allocation Sharing of joint timeslots and frequency bands Load ρ = Quality of Service (QoS) C c=1 QoS = 1 T r R D n c c ( ) 1 ρ 2 Energy/Power Consumption ) P = P BS + r (P RS + ρp AP 1. Physical Ressource Block 2. Physical Ressource Element ρ = min(ρ, 1) 5/38

Goal : optimization of the energy consumption under QoS constraints Formal framework considered : reinforcement learning [SB98] More specifically, Markov Decision Processes (MDP) [Put94] : A system state enumerates UEs of each radio condition and enumerates active resources An action is either null, either a deactivation or an activation of a resource A policy associates to every state an action to proceed In order to perform energy savings, one needs to compute or estimate an optimal policy, i.e. a policy which implements a good trade-off between energy (electrical power) consumption and targeted QoS level 6/38

MDP Controler A controler executes a policy Π, which for a given traffic amount, aims at maximizing an objective function (QoS, power) Transition Probability Operator P(s, a, s ) Instantaneous Reward Function R(s, a) Searching for an optimal policy for a fully known MDP model can be performed by dynamic programming 8/38

Controler for Geometric Criterion max Π ( [ ]) E φ t R(s t, Π(s t )) s = s t= Solving an equations system by iterating until reaching a fixed point (geometric criterion) : ( ( ) ) Π(s) = arg max a A P(s, a, s ) R(s, a) + φv(s ) s S V(s) = ( ) P(s, Π(s), s ) R(s, Π(s)) + φv(s ) s S Parameter φ [; 1[ 9/38

Controler for Average Criterion max Π ( lim E T [ 1 T ]) T R(s t, Π(s t )) s = s t= Solving an equations system 1 by iterating until reaching a fixed point (average criterion) : ( ( ) ) Π(s) = arg max a A P(s, a, s ) R(s, a) + V(s ) s S V(s) = ( ) P(s, Π(s), s ) R(s, Π(s)) + V(s ) s S 1. Dynamic programming valid if V < + 1/38

States Transitions and Rewards The system evolves in continuous time and not in discrete time It is possible to turn a continuous-time MDP into a discrete-time MDP via the use of uniformization and discretization schemes P(s, a, s ) is replaced by Q(s, a, s ) which denotes the transition rate (i.e. Poisson process parameter) R(s, a) is replaced by C(s, a) which denotes the cost per time unit 11/38

States Transitions Modeling ( ) Q (n, r), a, (n, r ) = λ i if B λi (s, a, s ) 1 r n R n D i i if B µi (s, a, s ) F i else Data traffic on LTE network with Round Robin scheduling 12/38

States Transitions Modeling ( ) Q (n, r), a, (n, r ) = λ i if B λi (s, a, s ) 1 r n R n D i i if B µi (s, a, s ) F i else ( ) B λi (s, a, s ) = n = n + e (i) r = r + a ( ) B µi (s, a, s ) = n = n e (i) r = r + a e (i) = C-uplet composed of s except i th element which equals to 1 12/38

Rewards - Costs Functions C(s, a) = s S Q(s,a,s ) ( ) γe(n, r + a) + (1 γ)f (n, r + a) Multiobjective optimization via scalarization 13/38

Rewards - Costs Functions C(s, a) = s S Q(s,a,s ) ( ) γe(n, r + a) + (1 γ)f (n, r + a) P BS + rp RS P BS + R(P RS + P AP ) E(n, r) = P BS + r(p RS + P AP ) P BS + R(P RS + P AP ) F (n, r) = 1 exp log(2)t r Rn C i=1 n i n i D i C i=1 if n = else Multiobjective optimization via scalarization 13/38

Current Results The optimal policy is a threshold policy The optimal policy depends on traffic volume, on target throughput and on cell capacity The execution of the optimal policy enables energy savings of the order of 4% Proposal of taking into account activation time by adding a timer Adaptation to traffic evolution through an ε-greedy strategy 14/38

Optimization under Congestion The controler does not activate the whole resources in order to reduce congestion as fast as possible 15/38

Unused Resources The controler cannot activate its whole resources 16/38

Excessive QoS The controler can grant an effective QoS level much greater than initially targeted QoS level (e.g. 5 Kbps 4 Kbps) 17/38

States Transitions Modeling Q(s, a, s ) = λ i if B λi (s, a, s ) 1 n n i n i 1 n r+a R D i F i if B µi (s, a, s ) B(s, a, s ) r R D i F i if B µi (s, a, s ) B(s, a, s ) else Temporality difference for action execution 19/38

States Transitions Modeling Q(s, a, s ) = λ i if B λi (s, a, s ) 1 n n i r+a R D i if B µi (s, a, s ) B(s, a, s ) F i 1 r n R n D i i if B µi (s, a, s ) B(s, a, s ) F i else ) ( ) B λi ((n, r), a, (n, r ) = n = n + e (i) (n = n n = N) ( ) r = r + a B(r, a, r) ) ( B µi ((n, r), a, (n, r ) = n = n e (i)) ( r = r + a B(r, a, r) B(r, a, r) = (r = r = 1 a = ) (r = r = R a = 1) ) Temporality difference for action execution 19/38

Ideal and Effective Power Consumption Ideal Power Consumption : { PBS + P RS if α(n) = P (n) = P BS + α(n) P RS + α(n)p AP Ideal Number of Resources 2 : ( C ) T α(n) = min n i R, R D i i=1 else Effective Power Consumption : { PBS + rp RS if n = ˆP(n, r) = P BS + rp RS + rp AP else 2. Solving equation F (n, r) = 1 2 = β 2/38

Power Consumption Error Modeling Normalized Regret : ˆP(n, r) P ( ) (n) R(P RS + P AP ) E (n, r), a = ˆP(n, r + a) P (n) R(P RS + P AP ) if B(r, a) else B(r, a) = (r = 1 a = ) (r = R a = 1) 21/38

Rewards - Costs Functions Symmetrical Instantaneous Reward : R(s, a) = E(s, a) Asymmetrical Instantaneous Reward : R θ (s, a) = E(s, a) I E(s,a)< θe(s, a) I E(s,a) 22/38

Results 23/38

Overall Performance β 1 2 3 4 9 1 current controler proposed controler γ ˆq,1 ˆq,5 ˆq,99 θ ˆq,1 ˆq,5 ˆq,99, 64, 98 +, 4 +, 8 1, 2 +, +, 2, 5 +, 21 +, 52 +, 84 1e 4, 2 +, +, 2, 4 +, 23 +, 54 +, 85 1e 8, 2 +, +, 2, 64, 98 +, 33 +, 6 1, 8 +, 2 +, 4, 5 +, 21 +, 44 +, 66 1e 2, 4 +, 2 +, 8, 4 +, 22 +, 47 +, 7 1e 4 +, +, 6 +, 12, 64, 98 +, 3 +, 41 1, 34, 2 +, 32, 5, 15, 21 +, 5 55e 3, 13 +, 21 +, 44, 4, 9 +, 27 +, 54 3e 3 +, +, 35 +, 54 24/38

Overall Performance β = 3 4 β = 9 1 25/38

Mobility Traffic due to arrivals and departures of UEs in the coverage zone of the BS, modeled by Poisson processes Moves of UEs inducing propagation losses, shadowing and fast-fading 27/38

Problem Statement The activation/deactivation timeframe of a physical resource is not taken into account in the modeling Idea : implement the prediction of states to be visited in the next seconds This approach makes it possible to consider mobile users Given SINR traces of users who crossed the cell and the SINR trace of a user currently crossing the cell, we aim at estimating the SINR to be measured in the near future 28/38

Problem Modeling Let T = {T 1,, T K } denote a set of time series Let T 1 = t 1,1,, t 1,N1 denote a time series... Let T K = t K,1,, t K,NK denote a time series Let T = t 1,, t N denote a time series to be completed ˆt N = f (T ) ˆt N = g(t ) T k D ˆt N = h(t ) T k D = {D 1,, D M } 29/38

Dynamic Time Warping (DTW) Let T = t 1,, t N denote a time series Let T = t 1,, t N denote another time series Let d denote a distance measure between elements of these time series D(t i, t j) = d(t i, t j) + min ( D(t i, t j), D(t i, t j), D(t i, t j) ) DTW (T, T ) = D(t N, t N ) Computation via dynamic programming in O(N 2 ), cf. [SC78] 3/38

Barycentric Averaging DTW Let T = {T 1,, T K } denote a set of time series Let T 1 = t 1,1,, t 1,N1 denote a time series... Let T K = t K,1,, t K,NK denote a time series The barycentric averaging DTW T satisfies (cf. [PKG11]) : N N, T = t 1,, t N K k=1 ( DTW (T, T k ) ) 2 K k=1 ( ) 2 DTW (T, T k ) Computation via iterative scheme in Θ(I K N 2 ) where I << N 31/38

Fast Dynamic Time Warping (FastDTW) Multi-level approach for the computation of the dynamic time warping, cf. [SC4] Linear spatial complexity Linear temporal complexity Approximation method enjoying a good precision (via tuning parameter r) Computation in Θ(I K r N) 32/38

Preliminary Results Estimations implemented with a precision of db order for time horizons of 1s order 33/38

Conclusion Summary : Review of State-of-the-Art controlers Proposal of a modified and improved controler Proposal of a mobility prediction mechanism (different from those proposed for intercells transfert management) Work in progress/perspectives : Integration of the mobility prediction module to the controler Enhancement of the mobility prediction mechanism Design of a higher-level control system for many cells, even for an entire network 35/38

References [PKG11] François Petitjean, Alain Ketterlin, and Pierre Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3) :678 693, 211. [Put94] [SB98] [SC78] [SC4] Martin Puterman. Markov decision processes : discrete stochastic dynamic programming. Wiley-Interscience, 1994. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning : An Introduction. MIT Press Cambridge, 1998. Hiroaki Sakoe and Seibi Chiba. Dynamic Programming Algorithm Optimization for Spoken Word Recognition. Transactions on Acoustics, Speech and Signal Processing, 26(1) :43 49, 1978. Stan Salvador and Philip Chan. FastDTW : Toward accurate dynamic time warping in linear time and space. In KDD workshop on mining temporal and sequential data. ACM, 24. 36/38

Thank you! Thanks for your attention! Questions? These research works are funded by Orange and supported by the collaborative research project ANR NETLEARN (ANR-13-INFR-4) 37/38

Appendix : example of a MDP-based controler 4 1 4 2 4 3 4 4 4 5 4 3 1 3 2 3 3 3 4 3 5 3 2 1 2 2 2 3 2 4 2 5 2 1 1 1 2 1 3 1 4 1 5 1 38/38