2.0 Scenarios
3.0 The Negotiation Problem in Different Domains
4.0 A Related Work - Economic Approaches
5.0 The Cooperative Distributed Problem Solving System
6.0 Reinforcement Learning In Autonomous Agent Systems
7.0 Q-Learning:How it works
8.0 Experiments Involving Reinforcement Learning
9.0 Conclusion
10.0 Appendices-References
The importance of automated negotiation is likely to increase as distributed systems of computers play an increasingly important role in society.Since informations and tasks are shared in distributed environment and agents have limited functionality, it will be necessary to consider ways in which these agents can be made to interact for resources and/or information effectively.
Negotiation denotes the process of several agents searching for an agreement. Agreement can be about price, about military agreements, about meeting place, about joint action, or about a joint objective.The search process may involve the exchange of information, the relaxation of initial goals, mutual concessions, lies, or threats. The way we use it, the term negotiation is closely related to the idea of reaching consensus .Separate agents, with potentially disparate interests, attempt to make a group of choice well-defined alternatives.
Research in distributed artificial intellegence is concerned with how automated agents can be designed to interact effectively.One important capability that could aid inter-agent cooperation is negotiation; agents could be built that are able to communicate their respective desires and compromise to reach mutually benificial agreements.
One of the presumed difficulties in using negotiation as a way of reaching mutual benefit is that negotiation is a costly and time-consuming process.In the present of time constraints, planning and negotiation time should be taken into consideration.The negotiation may be either about job sharing or resource allocation. In both cases we want to prevent the agents from spending too much time on negotiation and therefore not keeping to their timetables for satisfying their goals.
Research in DAI is divided into two basic classes:Distributed problem solving (DPS) and Multi agent Systems (MA).Research in DPS considers how the work involved can be divided among a number of modules or "nodes".The modules in DPS system are centrally designed to improve performance, stability, modularity, and/or reliability.They include the development of cooperation mechanisms designed to find a solution to a given problem.
Research in MA concerned with coordinating intelligent behaviour among a collection of autonomous (possibly heterogeneous) intelligent (possibly pre_existing)agents. In MA, there is no global control, no globally consistent knowledge, and no globally shared goals or success criteria.There is a real competition among agents.
Here we will describe MA as it deals with interactions among self_motivated, rational and autonomous agents.
Game theory is the right tool in the right place for the design of automated interactions. Game theory tools have been primarily applied to human behaviour, but in many ways they fall short: humans do not always appear to be rational beings, nor do they necessarily have consistent preferences over alternatives.Automated societies, on the other hand, are particularly amenable to formal analysis and design.Automated agents can exhibit predictability, consistency, narrowness of purpose (e.g., no emotions, no humour, no fears, clearly defined and consistent risk attitude), and an explicit measurement of utility ( where this can have an operative meaning inside the program controlling the agent).
Even the notion of "strategy" (a specification of what to do in every alternative during an interaction), a classic game theory term, takes on clear and unambiguous meaning when it becomes simply a program put into a computer.The notion that a human would choose a fixed strategy before an interaction, and follow it without alteration, can lead to unintutive results.Moreover, it seems more a formal construct than a realistic requirement- do humans consider every alternative ahead of time and decide what to do? On the other hand, programming a computer with a fixed strategy before an interaction is a simple description of the current reality.
Of course, neither humans nor computer programs are ideal game theory agents.Most importantly, they are not capable of unlimited reasoning power, as game theory often assumes. Nevertheless, it seems that in certain ways automated agents are closer to the game theory idealisation of an agent than humans are.
Game theory is concerned with the mathematics of interactions, and particularly with the mathematical tools that can be used to analize interactions.We use these tools great fully, but we are not proposing novel game theory tools in this report. We are using what exists.
What are we doing instead, is analysing specific domains using those game theory tools.We take, for example, a class of resource sharing problems, and analize in terms of equilibrium points, stable strategies, and pareto optimal solutions.The novelty is in using these existing tools to solve practical problems that arise when agents interact.We are designing protocols and strategies that are appropriate for specific domains, using game theory to solve engineering problems.
Automated negotiation between agents can occur in many different contexts.Consider, for example, two agents, each controlling a telecommunications network with associated resources.The load that each agent has to handle various over time, and in general it might be beneficial for each if they could share their resources.Techniques for coordinating this interaction could involve prices for renting out resources, prices that might vary as the message traffic varied on each network.
Another scenario might involve Personal Digital Assistants (PDAs), pocket computers that act as organisers and communication devices.I might use an agent embedded in my PDA for setting up meetings with others (via their PDAs), for negotiating with the auto mechanic about when I will bring my car in for repair, for notifying the taxi company when I will need a taxi to bring me home from the garage, and for ordering groceries and pizza for dinner.The true role of an automated secretary is to handle my interaction with the outside world, and coordinate my time and other resources with others.Will PDAs be able to do such sophisticated coordination? It is entirely Dependant on how sophisticated we are able to make their software.Communication among these devices is already common, but design of agent software remains primitive. The technology of miniaturisation and communication leaps ahead, while our software for coordination lags behind.
Imagine a situation in which priority for airplane landing is given to planes with less fuel on board.The system will work as follows: as a plane approaches an airport, an on-board computer radios to a computer in the control tower the amount of fuel remaining.This centralised computer, in tern, collects information from all incoming flights and produces a priority ordering, telling each plane when it will land.
Fuel is expensive, and customers are impatient.Airlines have a great incentive to get their airplanes onto the ground faster.Consider the situation, then, when airline X decides to program its computers to consistently under-report remaining fuel by, let's say, 3%.The discrepancy is small enough that it probably won't noticed by the flight crew, and any variation from expected remaining fuel that small can easily be blamed on flight conditions, the wind, and the variable weight of passengers and luggage.By this consistent under-reporting, however, the airline may find itself saving a great deal of money, and increasing its customer satisfaction.
Each airline, of course, has a similar motivation, and if all airlines under-report, the effect is neutralised.This only accentuates the problem: the airline that is honest is penalised for its honesty.There is another, hidden cost to this scenario. The designer of on-board computer is faced with the possibility of having to out-guess the other machines with which is competing.It is not an optimal strategy to simply tell the truth; instead, the airline might invest great energy in figuring out just how much it can shave off the fuel estimates under various circumstances, without being discovered ( if the weather conditions were difficult, I can under-report by an extra percentage point, but if I had a tail-wind this under-reporting will be noticed...).Ultimately, this need to second-guess and out-guess is wasteful of system resources.
One (potentially costly) solution is to institute a monitoring procedure, checking the fuel after landing.Airlines that are found under-reporting are given a sufficiently large fine to to discourage their behaviour.This is basically the system in use today, on things like military contracts; cost reporting is audited, and transgressors fined.In our scenario, however, there may still be plenty of room for manipulation, depending on just how accurate the fuel check can be.Extremely accurate fuel checks might be extremely expensive, and it just adjusts the amount of deception that is practical. Another possibility is to attempt to audit the software itself, but this is a monumental , and basically hopeless, task, given the ability of a sophisticated programmer to hide elements of the code.
Another approach might be to introduce or remove the incentive that the airlines have for programming their computers to lie.A tax could be levied on planes that that receive higher priority in the landing queue, with that money being paid out as subsidy to airlines with planes further back in the queue.This levels of utility that planes derive from having less fuel; through a suitable tax, they might be made indifferent between staying in the air ( and getting subsidy) or landing (and paying tax).The rule could still give earlier landing rights to low-fuel planes, but the airlines themselves feel no need to push ahead in line.This is a simple example of setting up an environment for automated design-making that encourages the computers' designers to build their creations in particular ways.Each designer individually may feel that this is, indeed, an equitable system that saves them the need (and ultimately the cost) of out-guessing what other designers will do.
Telephone subscribers in the United States have the option of choosing among various phone companies.They choose a long-distance carrier, who is then their default company for long distance calls, and by dialling special telephone numbers they can use any company's services at any time, overriding the default.
Consider an automated negotiation environment for phone calls.When you pickup your telephone receiver and dial your call, each of the long distance carriers responds with a a price quote for the call at the moment.The competition is dynamic: my phone call goes through the optimal carrier right now and an automated competition has been carried out among the computers representing the various telephone companies.
Once phone companies are bidding with each other, and the computer in my phone is handling the bidding procedure, there is still a question as to how the bidding should be carried out.The simplest method is the standard closed bid procedure, where each company makes a single bid, the winning bid is the lowest, and I pay the phone company the amount of bid.But even in this extremely straightforward process, phone company computers can use strategies that lead to inefficient and suboptimal outcomes.While it is clear that no company has an incentive to give a price lower than its true cost (it might win the bid and receive a larger fee).This is the standard reasoning that bidders go through: generate the highest bid that will still win the contract.It involves taking into consideration complicated factors, to try and out-guess the other bidders. The phone company computers, to be good competitive bidders in this scheme, will consider what other agents can afford, how low their bids will be, and do complicated analyses to win more phone calls from the customers.
This investment by the phone companies in complicated bidding agents, however, is basically waste of energy.It is not in consumers' interest to have phone fees spent in the design of bidding agents that can out smart one another.We would much prefer a protocol that simply motivates the phone companies to tell the truth: give my phone the true lowest price for which you are willing to carry the call.Can simple protocol like this be established?
The answer is qualified yes.There exists a bidding procedure known as Vickrey's Mechanism, where agents are motivated to submit bids that reflect their true worth calculations.Each agents submits a sealed bid, the lowest bid wins, but the price paid by the consumer is the second lowest bid.For example, if one phone company offers to carry my call for 20 pence a minute, and another offers to carry it for 22 pence a minute, the 20 pence bidder wins, but I pay him 22 pence per minute for the call.
Why might it be in the consumer's best interest to pay more money for his call? the reason is that bidders using Vickrey's Mechanism are motivated to tell the truth, giving their true lowest bid, without any need to try and out-guess opponents.The intuition is clear: a bidder will not bid too low, because if he wins the bid with this lower number, and would not have won it with the true bid, he ends up having to provide the service for less than it costs him.On the other hand, there is also no incentive to bid higher than the true value: bidding high might cause the bidder to lose the contract, but will never alter the the amount that he ends up getting (after all, that amount is using someone else's bid).So each bidder has an incentive to tell the truth.Over many phone calls, the consumer benefits from accurate bids, though he pays a bit extra to ensure it.If the bid tends to cluster, the difference being paid will not be great.By separating the issues of who wins the bid, and how much the winner gets, we have fundamentally altered the way in which computers should play the game.
Add to this the major advantage to the phone companies: no need to build complicated programs that try and calculate the optimal bid, based on the other companies' situations.The bid is relatively straightforward calculation of how much they need to receive to make the call worthwhile them to carry.
In multi agent competition situation there is a need to define a mechanism ( a protocol) that allows agents to resolve their conflicts and to reach a cooperative agreement.For a given class of interactions, some protocols might be suitable while others are not.We find it useful to categorise domains into a three-tier hierarchy of Task Oriented Domains, State Oriented Domains, and Worth Oriented Domains.Task Oriented Domains are a subset of State Oriented Domains, which are in turn a subset of Worth Oriented Domains.This hierarchy is by no means complete, but does cover a large proportion of the kinds of real-world interactions in which we are interested.
3.1 Task Oriented Domains (TOD)
Domains in which an agents' activity can be defined in terms of a set of tasks that it has to achieve. These tasks can be carried out without concern about interference from other agents; all the resources necessary to accomplish the tasks are available to the agent.On the other hand,it is possible that agents can reach agreements where they redistribute some tasks,to everyone's benefit.The domains are inherently cooperative. Negotiation is aimed at discovering mutually beneficial task redistribution.The key issues here are the notion of task,an indivisible job that needs to be carried out, and the fact that agents have unimpeded access to all necessary resources.
3.2 State Oriented Domains (SOD)
Domains where each agent is concerned with moving the world from an initial state into one of a set of goal states.These are the domains with which most artificial intelligence research has dealt.The Blocks efficiencyWorld,for example,is a classic State Oriented Domain.SODs are a superset of TODs.There is, of course, the possibility of real conflict here.Because of ,for example, competition over resources agents might have fundamentally different goals.There may be no goal states that satisfy all agents.At other times, there may exist goal states that satisfy all agents, but that are expensive to reach-and which require the agents to do more work than they would have had to do in isolation.In SOD, there are limited resources, and agents have to resolve conflicts over those resources.
3.3 Worth Oriented Domains (WOD)
Domains where agents assign a worth to each potential state, which captures its desirability for the agent.These are a generalisation of State Oriented Domains.The use of cardinal worth functions establishes a decision theoretic flavour to interactions on a WOD.The key advantage of a WOD is that the worth function allows agents to compromise on their goals,sometimes increasing the overall of the agreement. ed
What are the conditions that a Negotiation Protocol should satisfy(for any specific distributed multi-agent domain), such that it should be accepted by all the designers of agents (for that specific domain)?
Distributed -The decision making process should be distributed.There should be no central unit or agent that is managing the process.
Instantaneously -Conflict should be resolved without delay.
Efficiency -The outcome of the negotiations should be efficient
-Conflict should be avoided when possible and the mechanism should allow the agents to reach Pareto optimal agreements with high probability.An agreement is Parito optimal if there is no other agreement that dominates it, i.e., there is no other deal that is better for some of the agents and not worse for the others.
-In the resource allocation problem, the resource is not in use only when there is no agent in the group that currently needs the resource (there are no deadlocks).
Simplicity -The negotiation process itself should be simple and efficient.It should be short and consume only a reasonable amount of communication and computation resources.
Symmetric -The coordination mechanism should not treat agents differently because of non-relevant attributes.In the situations that we consider, the agent's utility functions and their role in the encounter are the relevant attributes.All other attributes, like an agent's colour, name or manufacture are not relevant.That is, Symmetric implies that given a specific situation, the replacement of an agent with another which is identical with respect to the above attributes, will not change the outcome of the negotiation.
Satisfiability Or Accessibility -In the resource allocation case we would like an agent that needs the resource to eventually has access to the resource (there is no starvation).In the task distribution case, we would like the task to eventually be performed.
There have been several attempts to consider market mechanisms as a way of efficiently allocating resources in a distributed system.The Contract Net is a high-level communication protocol for a distributed problem solving system.It enables the distribution of the tasks among the nodes that operate in the system.A contract between two nodes is established so that tasks can be executed; each node in the net can either as a manager or as a customer.A task that has been assigned to a node can be further decomposed by the contractor.A contract s established by bidding scheme that includes the announcement of the task by the manager, and bids sent in by the potential contractors.
Enterprise is a system that was built using a variation of the Contract Net Protocol. The Distributed Scheduling Protocol locates the best available machine to perform a task.This protocol is similar to the Contract Net, but makes more well-defined assignment criteria.
Another system that takes economic approach in solving a distributed problem solving through the use of price mechanism has been explored by Wellman's WALRAS system. Wellman uses the consumer/producer metaphor to establish a market pricing-based mechanism for task redistribution that ensures stability and efficiency.All agents act as both consumers and producers.Each distinct good has an auction associated with it, and agents can get the good by submitting bids in the auction for that good.
A cooperative distributed problem solving (CDPS) system is a distributed network of semi-autonomous processing nodes that work together to solve a single problem.Each node is a sophisticated problem solving system that can modify its behaviour as circumstances change and plans its own communication and planning strategies with other nodes.Research under the direction of Professor Victor R.Lesser focusses on how to achieve effective cooperation among agents, thus,balancing several interdependent criteria including efficient usage of processor and communication resources, responsiveness to unexpected situations and real-time deadlines, and reliability.This "effective" cooperation must be able to resolve both the uncertainty and the inconsistencies that can arise among the long term and the short term knowledge held by agents and also to exploit the interdependencies among subproblems being solved by agents.Three conceptual ideas motivate Lesser's approach to coordination strategies:
5.1 Satisficing vs. optimality
The concept of satisficing vs. optimality motivates the search for proper balance within problem
solving systems.It is impossible to achieve coordination strategies that optimise all the
interdependent criteria that go into evaluating the effectiveness of network performance.So,instead,strategies that achieves an acceptable balance between them are used to achieve an acceptable performance level of the network as a whole.
5.2 Tolerance of Inconsistency
Tolerance of inconsistency allows for the problem-solving task to be undertaken in spite
of the state of the knowledge base.An agent's problem solving can be structured so that it is not
necessary for its local knowledge bases to be complete,consistent,and up-to-date in order to make
progress in its problem solving tasks.Agents do best they can with their current information and are organised so that error resolution becomes an integral part of network problem solving.
5.3 Control Intelligence
The interplay between network and local control is an important ingredient in designing
effective coordination strategies.Control intelligence addresses this issue.An empirical aspect of this research is its empirical orientation.Because of infancy of the field,
there are few existent systems from which ideas and intuitions can be drawn to understand what strategies
will be most effective in given situations.Therefore,it is very important for proposed coordination strategies for CDPS to be empirically evaluated.
Research in machine learning has focussed primarily on supervised learning,where a "teacher" provides the learning system with a set of training examples in the form of input-output pairs The goal of this system is to implement an input-output mapping that generalises well to inputs outside the training set,ie:-adaptable to real-life situations.This case we use the basis of Reinforcement Learning (RL) to achieve our goals.However, this proves to be much more difficult in the supervised learning domain since it requires exploration, ie:- finding the best output for any given input.
RL is based upon the idea that the tendency to produce an action should be reinforced if it produces favourable results, and weakened if it produces unfavourable results.Basically,there are two types of RL:-
In non-sequential tasks,the agent must learn a mapping from situations to actions that maximises the expected immediate payoff.Sequential tasks are more difficult because the actions selected by the agent may influence its future situations and thus its future payoffs.So.in this case, agents need to interact in the environment and evaluate its actions on a basis of their long term consequences.
From the perspective of control theory, RL algorithms are techniques for addressing stochastic optimal control problems.
The agent is the controller ,and the environment is the system to be controlled.The objective is to
maximise some performance measure over time.Given a perfect model of the environment,these problems
can be solved in principle using dynamic programming(DP) algorithms.Q-Learning (Watkins,1989) is a
recent RL algorithm that approximates DP incrementally without requiring a model of the environment.
Unlike the traditional DP,Q-Learning can be used to improve performance on-line while the agent
and the environment interact.
The Q-Learning algorithm works by estimating the values of state-action pairs.The valueQ(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter.Once these values have been learned,the optimal action from any state is the one with the highest Q-value.We can estimate Q-valactionues on the basis of experience as follows:
small changes in Q(s,a) = x[r+ymaxQ(s',b)-Q(s,a)],
where x is the learning rate and 0<y<1 is the discount factor.
This algorithm is guaranteed to converge to the correct Q-values,provided the environment is stationary ,a lookup table is used to store the Q-values,every state- pair continues to be visited,and the learning rate is decreased appropriately over time.One such example of this algorithm is the Boltzmann distribution.
Multiagent RL has been studied since at least the 1950's,where the work of Tsetlin(1973) on the collective behaviour of learning automata provides an early example. However,in many of these studies,the agents have had independent or rather easy tasks to learn.On the other hand,the theoretical guarantees about Q-Learning do not apply in multiagent settings because the state of the other agents cannot be observed and because the environment is non-stationary due to other agents' learning. Encouraging results were received by Sen(1994);2-agent block pushing experiments,where agents try to make the block follow a line by independently applying forces to it,Bradtke(1993);applying multiagent reinforcement learning to efficiently damp out disturbances of a flexible beam,Littman and Boyan(1994);describes a distributes RL algorithm for packet routing,using a single,centralised Q-function.Success in these area of research ensures us that the goal of "learning self interested automated agents in a distributed environment" is highly in the near future.
The research we have done on automated negotiation has opened many avenues to the way we look at what is possible in the near future.In the discussion we have presented above,topics covering the complete designing conventions on automated negotiation as well as up-to-date advancement in this field.We have also presented the problems involved in making a distributed domain in fully working order.It is highly obvious that the main problems encountered by agents are inter-actions involved when negotiating and the strategies used by different self-interested agents designed by different companies.Hence,that is why we delved further into the domain where "learning" agents exist.The discovery of Q-Learning has therefore prompted more work and research into the aforementioned line of reasoning.Experiments so far has only been amongst two self-interested "learning" agents,for example,Multiagent reinforcement learning in the Iterated Prisoner's Dilemma (IPD) experiment [Tuomas Sandholm and Crites,1995) and Findler's work on how expert systems(learning environments) can control street traffic signals.Only recently has this theory been applied to the multiagent environment.Papers on Mutually Supervised Learning in Multiagent Systems[Claudia V. Goldman and Jeffrey S. Rosenschein,1995} have been produced and hopefully a fully working distributed multiagent learning system will be implemented in the future.
More research need to be done in order for these theories to be a success,where competitive learning issues are the main topics.Hopefully,better learning algorithms would be developed making systems more reliable and efficient.There would be no argument that the technology of automated negotiation would continue to progress in the future.
10.1 Appendix 1 : Academic Papers