14. UCB1 algorithm
argmax
i
( ¯xi +
r
2logn
ni
)
• : The mean payout for machine
• : The number of plays of machine
• : The total number of plays
¯xi +
r
2logn
ni¯xi +
r
2logn
ni
i
i
n
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
14
15. UCB1 algorithm
R(s, a) = Q(s, a) + c
s
logN(s)
N(s, a) 3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
a⇤
= argmax
a
R(s, a)
3/5
2/3 N(s, a) = 3
N(s) = 5
c = constant
Q(s, a) = 2/3
15
73. • Each edge stores a set of staIsIcs
• : combined mean acIon value
• : prior probability evaluated by
• : esImated acIon value by
• : esImated acIon value by
• : counts of evaluaIons by
• : counts of evaluaIons by
Searching
P(s, a)
Nv(s, a)
Nr(s, a)
Wr(s, a)
Wv(s, a)
Q(s, a)
s
a
v✓(s)
p⇡(a|s)
v✓(s)
p⇡(a|s)
p (a|s)
73
74. SelecIon
a⇤
= argmax
a
(Q(s, a) + u(s, a))
u(s, a) = cP(s, a)
pP
b Nr(s, b)
1 + Nr(s, a)
Choose acIon
a⇤
PUCT Algorithm:
ExploitaIon ExploraIon
Visit counts
of parent node s
Visit counts
of edge (s, a)
Level of exploraIon
s
74