ARE WE LEARNING NOW ?

by

ZARUL HAMZAH


CONTENTS





INTRODUCTION

Research in DAI is concerned with how automated agents can be designed to interact effectively.One important capability that could aid inter-agent cooperation would be based upon the idea of communication and compromise.Since the importance of multi-agent technology is likely to increase, by optimising the interaction, we can get an increase in the cooperation as well as getting quality schedules.In this article, we will be discussing the best possible methods of acheiving this goal as well as the problems involved in its implementations. This strategy involves the concepts of learning in self-interested agents and it incorporates ideas of both computing and animal behaviour natures.The general idea of this technique involves a "learning" agent that learns about the environment that its in as well as other agents' strategies, adapt to it and use the best possible methods to counter-act and maximise its own utility. Therefore, we can see that the basis of "self interest" is still evident in this context, but only the method an agent uses to acheive its goals has changed.This technique uses the idea of Game Theory [Rosenschein and Genesereth,1985] to model negotiation.





Relations to Game Theory

The main problem concerned when analysing multi-agent negotiation is the "language" used in this context.A common vocabulary among both designers and agents alike needed to be realised before any big advancement is to be done.General concepts of "negotiation" has been discussed infinitely, whereas what is meant by the term itself has not.By importing the techniques of Game Theory into these analysis, we introduce tools for designing negotiation in automated agents. However,while Game Theory generally deals with negotiation in continuous domains and among agents with complete information,discrete domains and also agents with partial or no information whatsoever are also considered (since that is evident in the real world).So, standard Game Theory definitions are used for the most parts, ie:- used to analysed problems dealing with specific interest to artificial intelligence.





Reinforcement Learning and the IPD

The technological advancement in this field uses the ideas based upon Reinforcement learning (RL) among agents. Reinforcement Learning is the process by which an agent improves its behaviour in an environment via experience.Most RL learning research has been confined to single agent settings or to multiagent settings where agents have either positively correlated payoffs(team problems) or totally negative correlated payoffs(zero-sum games).This idea is then acted upon studies in the Iterated Prisoner's Dilemma(IPD), where RL is considered more difficult.Reinforcement Learning is applicable in cases where the learning system is not provided with a target output for each input, but instead must select an output for which it receives a scalar evaluation.This definitly requires exploration, ie:- finding the best output for any given input.Naturally, if an action produces a favourable result, it is strengthened and if it produces an infavourable result, it is weakened.This framework is appealing from a biological point of view, since an animal has certain built-in preferences (such as pleasure or pain) but generally has no external signal telling it is the best action for any given situation.



More on Reinforcement Learning and the IPD

Carrying on from the given idea, RL tasks can be divided into two types.

In non-sequential tasks, an agent must learn a mapping of situations to actions that maximises the expected immediate payoff.Putting in a better context, agents have their actions-to-reward manual by their side. Sequential tasks are more difficult because actions selected by agents may influence its future situations and thus its future payoffs, ie:-if you get conned by an "insurance" salesman, you are more likely to be more cautious in the future.In this case,the agent interacts with the environment over an extended period of time, and it needs to evaluate its actions on the basis of their long-term consequences.This involves a credit assignment problem, ie:- a whole sequence of actions takes place before long term consequences are known.This would be difficult because actions in a sequence may have different values with respect to the consequences.From the perspective of control-theory, RL algorithms are techniques for addresing stochastic optimal control problems, where the agent is the controller and the environment is the system to be conrolled.The oblective of research is to maximise performance measures over time.Returning to the IPD idea,whence we apply situations of RL to this framework.The IPD is a 2-agent prisoner's dilemma game describing situations concerning of a social type where a prisoner is faced with two alternative options: cooperating,ie:- doing the socially responsible thing, and defecting,ie:-acting according to self interest regardless of how harmful this might be to the other agent.Characteristically,each agent is better off defecting reagardless of the opponent's choice, but the sum of the agents' payoffs is maximised if both agents cooperate-thus the dilemma.In an IPD game,it may be benificial even for a selfish agent to cooperate on some iterations in the hope of soliciting cooperation from its opponent.Generally,describing an intelligent strategy for an IPD is difficult because arbitrarily long input histories,ie:-experiences the agent has had must be considered.





The Basis of Q-Learning

Q-learning [Watkins,1989] is a recent form of Reinforcement Learning algorithm that does not need a model of its environment and can be used on-line.Therefore, it is very suited for repeated games against an unknown opponent. (remember IPD!!!!!!).Q-learning algorithms works by estimating the values of state-action pairs.The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter.Onec these values have been learned, the optimal action from any state is the one with the highest Q-value.After being initialized to arbitrary numbers, Q-values are estimated on the basis of experience as follows:-

  1. From the current state s, select an action a.This will cause a receipt of an immediate payoff r, and arrival at a next state s'.
  2. Update Q(s,a) based upon this experience as follows:

    small changes in Q(s,a) = x[r + ymaxQ(s',b)-Q(s,a)]
    where x is the learning rate and 0 < y < 1 is the discount factor
  3. Go to 1.

This algorithm is guaranteed to converge to the correct Q-values with the probability one if the environment is stationary and depends on the current state and the action taken in it;called Markovian,a lookup table is used to store the Q-values, every state-action pair continues to be visited, and the learning rate is decreased appropriately over time.This exploration strategy does not specify which action to select at each step. In practice, a method for action, called the Boltzmann distribution strategy, is usually chosen that will ensure sufficient exploration while still favouring actions with higher value estimates. Experiments with Q-learning agent have been done in the past with favourable results.Bradke(1993) described encouraging results in applying multiagent reinforcement learning to efficiently damp out disturbances of a flexi beam.Littman and Boyan (1994) describe a distributed RL algorithm for packet routing.Littman (1994) also describes experiments with Q-learning agents that try to learn a mixed strategy that is optimal against the worst possible opponent in a zero-sum 2-player game. So, we can see that sufficent work has been done and even more underway to acheive our goal of "learning" artificial intelligence.



CONCLUSION

Computational learning capabilities are an important component of intelligent agents, particularly when their designers cannot anticipate all solutions that the agents might encounter and thus cannot preprogram the agents to operate as desired.So,Reinforcement Learning (RL) adddresses this by taking in any particular situation as unknown,and just act upon experience.While selfish behaviouris neccessary in zero-sum games of pure competition and in team competitions where the goals of each agent are identical to the goals of the team, it is often self-defeating in situation of those two extremes.This makes learning all the more difficult. This becomes worse when we have multiple agents learning simultaneously.Future works should examine more closely the effect of exploration strategies on the types of patterns that develop.For example, what would happen between agents using different annealing strategies or schedules other than Boltzmann. In addition to that, it would also be interesting to train agents not just against single opponents but against a variety of opponents, as would be in a tournament situatuion. This might enable agents to learn more robust strategies.Nevertheless, computer scientists are taking steps in the right direction for the future civilisation to enjoy.So, Enjoy!!!!!



References

Tuomas W. Sandholm and Robert H. Crites.On Multiagent Q-Learning in a Semi-competitive Domain,1994.University of Massachusetts at Amherst.

Jeffrey S. Rosenschein and Gilad Zlotkin.Negotiation and Task Sharing Among Autonomous Agents in Cooperative Domains,1989.Hebrew University,Jerusalem.

Jeffrey S. Rosenschein and Gilad Zlotkin.Rules of Encounter.MIT Press,Cambridge,Massachusetts,1994.

Watkins,C,1989.Learning from Delayed Rewards,Thesis,University of Cambidge,England.

Littman,M. and Boyan,J.1993.A Distributed Reinforcement Learning Scheme for Network Routing,TR CS-93-165,CMU.


[up to top]