As part of my M.S. in Computer Science at UCLA, I partook in a class called Animats, in which we studied artificial agents in software that had capabilities to learn, adapt to, and interact with their respective software environments. The main concept in this course was the study of the emergence of what seemingly appears to be a higher order intelligence from what is actually an underlying simplistic level of intelligence.
As part of this course, we were tasked with proposing and implement an aritifical agent that could demonstrate this concept. This blog post simply gives a high level overview of my project, approach, and results.
For this project, I wanted to see if I could induce colony-like behavior from self-interested individual artificial agents, without the use of communication between agents.
What exactly does this jargon mean? In more concrete terms, I want ants (in software) that have no way of communicating with each other to appear like they are communicating by forming colonies. A colony, in the context of this project, is defined as a centralized location where some ants bring resources to and other ants "hang out to eat".
To simplify, our project made some underlying assumptions:
The environment (2D plane) where the ants reside consists of 4 food types (e.g. sugars, proteins, vitamins, and water) that an ant will need to consume to survive. There are food generators and food objects. Food generators generate food of a given type at set intervals.
Ants have gradient sensors that tell them in which direction (North, South, East, or West) from their current position each food type is located. However, ants will only "target" the food type which is currently the most depleted. For example, if the ant is low on water, it will ignore seeking out sugars, proteins and vitamins and only head towards a water source in the environment.
Ants have the ability to pick up food, move around with food in their jaws, and drop food back into the environment. They can also bite into food to eat. When the ant eats food, it replenishes the ant's energy (positive reward), but if an ant were to drop one piece of food type onto another piece of food type (e.g. water and sugar) and eat those two foods together, it would receive bonus energy which exceeds the sum of the energy that would have been gained by eating each food type individually. Therefore, there is great incentive for the ant to collect all the food in one location before biting down to eat it.
Moving around the environment and picking up and dropping food utilizes energy (negative reward or penalizer).
Each ant in the environment is independent of one another (i.e. it has its own brain, its own self interests and is not part of some sort of colony). Each has its own internal states and actions (more on this later).
The goal is to see if we can construct a system in which the ants will learn that in order to maximize their individual energy, it is most beneficial to have some ants pick up the 4 food types from the environment and bring them into one central location, where they can consistently consume multiple food types at once. As a byproduct, this will also increase the net energy stored in all the ants collectively, since now the majority of ants will not have to spend energy to scavenge for the environemnt for food.
Also, remember, we want to do this all without the ants explicitly communicating which other. The underlying intelligence should be rather minimal.
So how do we accomplish this?
The Approach: Q-Learning
We modeled each individual ant as a Q-Learning agent. A Q-Leaner learns by exploring the possible actions it can take at any given state. Once a sequence of actions leads to a reward, the corresponding state-actions' Q-values are updated. Over time, the optimal policy (sequence of actions) will be learned by the agent in order to maximize its reward. For more information, check out this video on reinforcement learning.
For our ants, we have a finite amount of states and actions. At any given time, the internal state of the ant is represented by a bitmap representing the following:
1. Target Food - the food corresponding to the food type that the ant is lowest on. An ant cannot target a food type if it is currently holding that a food of that type (2 bits to represent the 4 food types).
2. Holding Food - whether or not an ant is holding a food item of a certain type (4 bits, one for each food type).
3. On Food - whether or not an ant is currently on top of a food of a certain type in the environment (4 bits, one for each food type).
4. Gradient - for each food type, the direction (North, South, East, West) that leads to the closest food object in the environment (8 bits, 2 for each food type).
In addition to states, we have actions. An ant is able to perform any of the following actions:
- Move North
- Move South
- Move East
- Move West
- Eat Food
- Pickup Food (only if there is no food curretly picked up)
- Drop Food (only if there is food currently picked up)
Therefore, the state space is 2^18 unique states while the action space is 7 unique actions. Even though this space is relatively large to explore, Q-Learning is a good candidate for solving this problem since it can find the optimal-policy solution without necessarily needing to visit the optimal-policy (this is called off-policy learning).
Did It Work?
For the project, we programmed six different simulation scenarios to test our hypothesis, but I'll only talk about the two most interesting results.
Scenario I: 4 Distributed Food Sources
For the first scenario, we placed the 4 food types in opposite sides of the 2D environment and let the ants run through their simulation. You can see a visualization of this in the video above. Over time, a few ants pick up food from the respective food sources and bring them to the center of the environment, where we then see a colony of ants start to form (multiple ants and food sources occupying the same small region of space). What's interesting to note here is that not every ant needed to learn the optimal policy of picking up food and bringing it to a centralized location. As long as a few ants learned this altruistic policy, the other ants could "freeload" off of the colony's centralized resources without needing to learn the policy themselves.
Scenario II: 4 Distributed Food Sources -- Unlimited Food
In this second scenario, we place only 1 instance of each food type in opposite sides of the environment. However, these food objects are infinitely large (i.e. they cannot be totally consumed by the ants) and thus can be moved around in entirety. From the animation, we see that the ants initially start by visiting each food source on the map individiually and eating. Over time, a few ants once again learn to pick up and drop the food in a centralized location to minimize their energy expenditure in having to traverse the environment. What's unique in this scenario is that the colony itself moves around the environment. This may be due to a few ants who are still exploring their state-action space; however, it is interesting to note that as soon as one food type is moved away from the colony by a "rogue" ant, other "better trained" ants quickly move the food type back to the colony area or move the other three food types closer to the displaced food type in order to maintain the colony.
Overall, it seems from these initial experiments that the simple Q-Learning approach to colony formation works with self-interested agents, given the right conditions. Even though none of the ants in the simulations were able to communicate with each other, the ants formed colonies near the enter of the distributed food locations.