160
goal and –100 when she depart from the area in any
other way.
Fig. 4. Model of discretized world
There is of course many more useful signals e.g.
distance to goal (d), penalty for frequent course
change, negative reward for recede from goal. To
simplify calculations we assume that speed of the
ship is constant. Risk of grounding can be treated as
multi-criteria problem which calculates a danger of
getting stranded on shallow water. It can be
estimated by function of ship’s position, course and
angular velocity.
More signals in state vector and reward function
can improve projection of real coast situation to
estimated state value function but also can increase
computation complexity greatly. If one assume that
state vector is described by 100 x 100 matrix of
available position, 360 courses, 41 radial velocities
and 71 rudder angles it will make more than 1mln of
state-action pair real type values and it goes double
with eligibility traces. One can deal with this
problem by discretization of huge state space and
estimate state-action pair values with common
approximation methods.
In case of navigation task discretization of ship
position, course and rudder angle can significantly
improve learning rate with acceptable approximation
of overall model to real situation fidelity.
An example of discretized state space is shown in
figure 4. This is a part of application interface
created and tested by the author. Experimental
results showed that in simpler layouts of possible
routes and few obstacles reinforcement learning
SARSA algorithm was able to find proper although
not optimal helmsman behavior after about 800-
2000 epizodes.
There were other approaches containing weaker
discretization of state space and a maps with detailed
obstacles (i.e. shallow waters) like in figure 2.
Additionally to improve value backups in episodic
learning process an eligibility traces where used.
5 CONCLUSIONS
Experimental results with 1-step Q-Learning proves
its slow learning rate which is very inconvenient in
large state space problems. Eligibility traces, which
bring learning closer to Monte Carlo methods, have
improved learning speed. It was also very important
to dynamically change the learning parameters
during learning process.
SARSA algorithm uses longer but safer way
during learning process accordingly to its value
function update.
Using parameterised function approximation for
generalization (Sutton, R. 1996) or artificial neural
networks is the next step can improve reinforcement
learning process in ship handling.
Some other advanced algorithms like prioritized
sweeping can be taken into consideration in future
work.
Furthermore splitting one agent to multi-agent
environment could bring some new solutions to this
problem.
REFERENCES
Eden, T. Knittel, A., Uffelen, R. 2002. Reinforcement
Learning: Tutorial
Kaelbling, L.P. & Littman & Moore. 1996. Reinforcement
Learning: A Survey
The Reinforcement Learning Repository, University of
Massachusetts, Amherst
Sutton, R. 1996. Generalization in Reinforcement Learning:
Successful Examples Using Sparse Coarse Coding. In
Touretzky, D., Mozer, M., & Hasselmo, M. (Eds.), Neural
Information Processing Systems 8.
Sutton, R. & Barto, A. 1998. Reinforcement Learning:
An Introduction
Tesauro, G. 1995. Temporal Difference Learning and TD-
Gammon, Communications of the Association for
Computing Machinery, vol. 38, No. 3.