Updates the q-value of a state-action pair after the action has been taken in the state and an immediate reward has been received.
What is the meaning of update the q-value of a state ?
Q(s,a)
r
fitness ( )
= Q(s,a) +
Q(s,a) = Q(s,a) + ( α * r )
Q(s,a) = 0 + ( α * r )
Q(s,a) = 0 + ( α * 0 )
Q(s,a) = 0 + ( 0.3 * 0 )
0.6 = 0 + ( 0.3 * 0 )
-2.4 = 0.6 + ( 0.3 * -10 )
by controlling α (learning-rate)
Should it be high? low?
varying?
Exploration policy has probability ε to select a random action, and probability 1 - ε to select the better known action.
Instead of keeping a constant exploitation-rate, we applied meta-reasoning on ε-value too. It keeps exploration-rate equals to half of learning-rate.
BWAPI -
Framework in C++ to inject code at StarCraft
BTHAI -
BWAPI implemented bot, with some pre-made high-level strategies
High-Level Strategy:
Learning and selection of high-level strategies, not atomic tasks;
At end of each match
Victory Feedback: +1
Defeat Feedback: -1
Terran vs Built in CPU:
The learning agent always plays as Terran.
Use a spacebar or arrow keys to navigate