- Lorentz Center

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion Yishay Mansour, TAU and MSR Ohad Shamir, Weizmann Nonstochastic sequential decision-making • K actions and T time steps • lt(a) – loss of action a at time t • At time t – player picks action Xt – incurs loss lt(Xt) – observe feedback on losses • Multi-arm bandit: only lt(Xt) • Experts (full information): lt(j) for any j 2 Nonstochastic sequential decision-making • Goal: – minimize losses – benchmark: The best single action • The action j that minimizes the loss – no stochastic assumptions on losses • Regret T R T  E [   ( X t )]  min j player loss best t 1     – MAB t ( j) t 1     action • Known regret bounds: – Experts 3 T TK T ln K Motivation – observablity undirected directed 4 undirected observation graph ? ? ? ? ? ? ? ? 5 undirected observation graph ? ? 3 ? ? ? ? ? 6 undirected observation graph 7 5 3 ? 1 ? ? ? 7 undirected observation graph • MAB: no edges • Experts: clique ? ? 3 ? 3 ? ? ? 4 1 8 2 ? 8 7 5 6 Modeling Directed vs Undirected Informed vs Uniformed • Different types of dependencies • When does the learner observes the graph • Different measures – Independent set – Dominating set – Max Acyclic Subgraph 9 – Before – After • only the neighbors Our Results Uniformed setting • Undirected graph • Uniformed setting Informed setting • Directed graphs – Only the neighbors of the node • Regret characterization – dominating sets and ind. set – Independent sets ~ O ( T  ( G ) ln K ) • Both expectation and high prob. – Max Acyclic Subgraph (not tight) • Directed graph – Random Erdos-Renyi graphs 10 EXP3-SET • Online Algorithm Pr[ X t  a ]  exp(   t s 1  t (a )  ˆ ( a )   Pr[  ( a ) is observed  t t  0  ˆ ( a ) ) s ] where if  t ( a ) obseved otherwise E [ ˆ t ( a )]   t ( a ) • Theorem RT  11 ln K    T Pr[ X  2 t  a |  t ( a ) is observed] t 1            (Gt )  T  (G)lnK       (G t )lnK EXP3-Set Regret – key lemma • Lemma Q  • Proof: Build an i.s. S Pr[ X t  a ]  Pr[ a observed a Note: MAB: Q=K Full info. Q=1 ] • Note  j N ( a ) 12   (G ) – consider action a with minimal Pr[a observed] – Add a to S – Delete a and its neighbors Pr[ X t  j ] Pr[ j observed ]   j N ( a ) Pr[ X t  j ] Pr[ a observed ] 1 Dominating set – directed graph ? ? ? ? ? ? ? ? 15 Dominating set – directed graph ? ? ? ? ? ? ? ? 16 EXP3-DOM • Simplified version – fixed graph G – D is dominating set • log approx • Main modification – add probabilities to D • induce observability • probabilities: p a , t  (1   ) w a ,t Wt   I [a  D ] |D| • Select Xt using pt • Observe lt(a) for a in SXt,t • weights w a ,t 1  w a ,t exp(   ˆ t ( a ) / | D |) ˆ ( a )  t 17  t (a ) Pr[ observe a ] I [ i  S X t ,t ] EXP3-DOM • Simple example • Transitive observability – tournament • action 1 observes all actions – D={1} • EXP3-DOM • Sample action 1 with prob γ – action 1 is the exploration • Otherwise run a MAB – specifically EXP3-SET • Intuition – action 1 replaces mixture with uniform 18 Conclusion • Observability model – Between MAB and Experts • more work to be done • Uninformed setting – Undirected graph • Informed setting – Directed graph • [Kocak, Neu, Valko and R. Muno] improved uniformed 19 Outline • Model and motivation • symmetric observability • non-symmetric observability 23 EXP3-DOM: key lemma • Lemma • Proof: high level – G directed graph, – d-i indegree of i, – α=α(G) K 1  1 d i 1  i K    2 ln  1      – shrink graph • GK,Gk-1, … – delete nodes • step s: – delete max indegree node • From Turan’s theorem • Turan’s Theorem – undirected graph G(V,E) |V |  1 24   ( G );   2|E | |V | max d  i  | Ds | | Vs |  | Vs | 2 s  1 2 EXP3-DOM: key lemma (proof) • Completing the proof K 1  1 d i 1 K  K 2 K K  K K  1 d 2 K     i,K 2 i  i 1 i K 1 i  1, K K   1  1 d i2 1  1 d i2 K 1   i,K 1  1 d i 1  i,K  i , K 1  2 ln( 1  K  ) • Note, due to edge elimination 25   d i  1, K  d i , K 1 EXP3-DOM- Key lemma (modified) • Lemma (what we really need!) • G(V,E) directed graph – INi indegree of i – r size dominating set; and α size ind. set – p distribution over V • pi≥β Q  26  pi K i 1 pi   j  IN i   K 2    K     r   2 ln  1    2r pj        EXP3 –DOM: changing graphs • Simple – all dom. set same size – approx. same size • Problem – different size dom. set • can be 1 or K • Solution – keep log levels • depend on log2 (Dt) – algorithm per level 27 • Complications – parameters depend on level – setting the learning rate • need a delicate doubling • Main tech. challenge – handle dynamic adversary. EXP3-DOM • receive obs. graph – find dominating set Dt • logarithmic approximation • Run the right copy – Let bt = log2 (Dt) – run copy bt • log copies • For Copy bt – param. depend on bt 28 • probabilities: p a , t  (1   ) w a ,t Wt   Dt I [a  Dt ] • Select Xt using p • Observe lt(a) for a in SXt,t • weights w a ,t 1  w a ,t ˆ ( a )  t bt ˆ exp(    t ( a ) / 2 )  t (a ) Pr[ observe a ] I [ i  S X t ,t ] EXP3-DOM – main Theorem • Theorem: log K RT   b0 b 2 ln K  b   E [ b b t T b 1 Qt 2 b 1 ] • tuning γb R T  O ((ln K ) E [ 29  T t 1 4 | D t |  Q t t ]  (ln K ) ln( KT )) b Independent set • Independent set α(G) • [Mannor & Shamir 2012] • Tight Regret ? ? T  ( G ) ln K ? ? ? ? – α(G) “replaces” K ? • Cons: ? 30 – requires to observe G – solves an LP each step

- Lorentz Center

Related documents

Products

Support

- Lorentz Center

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib