Nonstochastic Multi-Armed Bandits
With Graph-Structured Feedback
Noga Alon, TAU
Nicolo Cesa-Bianchi, Milan
Claudio Gentile, Insubria
Shie Mannor, Technion
Yishay Mansour, TAU and MSR
Ohad Shamir, Weizmann
Nonstochastic sequential
decision-making
• K actions and T time steps
• lt(a) – loss of action a at time t
• At time t
– player picks action Xt
– incurs loss lt(Xt)
– observe feedback on losses
• Multi-arm bandit: only lt(Xt)
• Experts (full information): lt(j) for any j
2
Nonstochastic sequential
decision-making
• Goal:
– minimize losses
– benchmark: The best
single action
• The action j that
minimizes the loss
– no stochastic
assumptions on losses
• Regret
T
R T E [ ( X t )] min
j
player loss
best
t 1
– MAB
t
( j)
t 1
action
• Known regret bounds:
– Experts
3
T
TK
T ln K
Motivation – observablity
undirected
directed
4
undirected observation graph
?
?
?
?
?
?
?
?
5
undirected observation graph
?
?
3
?
?
?
?
?
6
undirected observation graph
7
5
3
?
1
?
?
?
7
undirected observation graph
• MAB: no edges
• Experts: clique
?
?
3
?
3
?
?
?
4
1
8
2
?
8
7
5
6
Modeling
Directed vs Undirected
Informed vs Uniformed
• Different types of
dependencies
• When does the learner
observes the graph
• Different measures
– Independent set
– Dominating set
– Max Acyclic Subgraph
9
– Before
– After
• only the neighbors
Our Results
Uniformed setting
• Undirected graph
• Uniformed setting
Informed setting
• Directed graphs
– Only the neighbors of the node • Regret characterization
– dominating sets and ind. set
– Independent sets
~
O ( T ( G ) ln K )
• Both expectation and high
prob.
– Max Acyclic Subgraph (not tight)
• Directed graph
– Random Erdos-Renyi graphs
10
EXP3-SET
• Online Algorithm
Pr[ X t a ] exp(
t
s 1
t (a )
ˆ ( a ) Pr[ ( a ) is observed
t
t
0
ˆ ( a ) )
s
]
where
if t ( a ) obseved
otherwise
E [ ˆ t ( a )] t ( a )
• Theorem
RT
11
ln K
T
Pr[ X
2
t
a | t ( a ) is observed]
t 1
(Gt )
T (G)lnK
(G t )lnK
EXP3-Set Regret – key lemma
• Lemma
Q
• Proof: Build an i.s. S
Pr[ X t a ]
Pr[ a observed
a
Note:
MAB: Q=K
Full info. Q=1
]
• Note
j N ( a )
12
(G )
– consider action a with
minimal Pr[a observed]
– Add a to S
– Delete a and its
neighbors
Pr[ X t j ]
Pr[ j observed ]
j N ( a )
Pr[ X t j ]
Pr[ a observed ]
1
Dominating set – directed graph
?
?
?
?
?
?
?
?
15
Dominating set – directed graph
?
?
?
?
?
?
?
?
16
EXP3-DOM
• Simplified version
– fixed graph G
– D is dominating set
• log approx
• Main modification
– add probabilities to D
• induce observability
• probabilities:
p a , t (1 )
w a ,t
Wt
I [a D ]
|D|
• Select Xt using pt
• Observe lt(a) for a in SXt,t
• weights
w a ,t 1 w a ,t exp( ˆ t ( a ) / | D |)
ˆ ( a )
t
17
t (a )
Pr[ observe a ]
I [ i S X t ,t ]
EXP3-DOM
• Simple example
• Transitive observability
– tournament
• action 1 observes all
actions
– D={1}
• EXP3-DOM
• Sample action 1 with
prob γ
– action 1 is the
exploration
• Otherwise run a MAB
– specifically EXP3-SET
• Intuition
– action 1 replaces
mixture with uniform
18
Conclusion
• Observability model
– Between MAB and Experts
• more work to be done
• Uninformed setting
– Undirected graph
• Informed setting
– Directed graph
• [Kocak, Neu, Valko and R. Muno]
improved uniformed
19
Outline
• Model and motivation
• symmetric observability
• non-symmetric observability
23
EXP3-DOM: key lemma
• Lemma
• Proof: high level
– G directed graph,
– d-i indegree of i,
– α=α(G)
K
1
1 d
i 1
i
K
2 ln 1
– shrink graph
• GK,Gk-1, …
– delete nodes
• step s:
– delete max indegree node
• From Turan’s theorem
• Turan’s Theorem
– undirected graph G(V,E)
|V |
1
24
( G );
2|E |
|V |
max d
i
| Ds |
| Vs |
| Vs |
2 s
1
2
EXP3-DOM: key lemma (proof)
• Completing the proof
K
1
1 d
i 1
K K
2 K
K K
K
1 d
2 K
i,K
2 i
i 1
i
K
1
i
1, K
K
1
1 d
i2
1
1 d
i2
K 1
i,K
1
1 d
i 1
i,K
i , K 1
2 ln( 1
K
)
• Note, due to edge elimination
25
d i 1, K d i , K 1
EXP3-DOM- Key lemma (modified)
• Lemma (what we really need!)
• G(V,E) directed graph
– INi indegree of i
– r size dominating set; and α size ind. set
– p distribution over V
• pi≥β
Q
26
pi
K
i 1
pi
j IN i
K 2
K
r
2 ln 1
2r
pj
EXP3 –DOM: changing graphs
• Simple
– all dom. set same size
– approx. same size
• Problem
– different size dom. set
• can be 1 or K
• Solution
– keep log levels
• depend on log2 (Dt)
– algorithm per level
27
• Complications
– parameters depend on
level
– setting the learning rate
• need a delicate doubling
• Main tech. challenge
– handle dynamic
adversary.
EXP3-DOM
• receive obs. graph
– find dominating set Dt
• logarithmic
approximation
• Run the right copy
– Let bt = log2 (Dt)
– run copy bt
• log copies
• For Copy bt
– param. depend on bt
28
• probabilities:
p a , t (1 )
w a ,t
Wt
Dt
I [a Dt ]
• Select Xt using p
• Observe lt(a) for a in SXt,t
• weights
w a ,t 1 w a ,t
ˆ ( a )
t
bt
ˆ
exp( t ( a ) / 2 )
t (a )
Pr[ observe a ]
I [ i S X t ,t ]
EXP3-DOM – main Theorem
• Theorem:
log K
RT
b0
b
2 ln K
b
E [
b
b
t T
b
1
Qt
2
b 1
]
• tuning γb
R T O ((ln K ) E [
29
T
t 1
4 | D t | Q t t ] (ln K ) ln( KT ))
b
Independent set
• Independent set α(G)
• [Mannor & Shamir 2012]
• Tight Regret
?
?
T ( G ) ln K
?
?
?
?
– α(G) “replaces” K
?
• Cons:
?
30
– requires to observe G
– solves an LP each step