As for issues, the big one was time. Not in time of coding, but in time of training. I had to use a virtual machine with ubuntu on it as Open AI Gym wasn’t playing nice with windows. This massively slowed down the training times as I couldn’t fully use my computer’s processing power. Moving forward for my final I should be using some AWS servers to train so that I don’t have to worry about my poor little computer overheating or having a stroke. With all these explanations for my architectural decisions, lets get into the project.
The basics of a neural network are as follows, a network takes in a set of inputs (usually a matrix of numbers), it then feeds these numbers through one or many layers. Each layer does mathematical operations on the numbers. Finally at the end of the layers, an output is given. This is mostly all done through various matrix multiplication and dot products to change the size of the matrix being passed between the input, the hidden layers, and the output.
For my project, the input was a series of pixel data as that’s what AI gym returns on each frame. After receiving that input I did some preprocessing on it, downsampling it to reduce computation time (things like reducing resolution by removing every other pixel, gray scaling it to remove color, removing backgrounds) and converting it to a workable matrix of values.
Once I had my input data, I fed that to the hidden layer. I used a single layer network, primarily because it allowed me to cut some corners with my math and to reduce computation time. I’m honestly surprised a single layer was deep enough to properly process this information, but if there’s anything I learned from this project, the more I think I understand how these algorithms work, the more I’m surprised by the exceptions.
The hidden layer, once finished with its computations sends its result as an output. The result I worked with was the probability of moving the paddle up or down. In my initial tests I had it give the actual value 1 or 0 by just rounding the probability to 1 or 0, but by instead taking the probability and throwing a weighted coin it added an extra layer of dynamic randomness which really improved the learning rate. I want to say I understand why it helps…but…I’m honestly not sure, I’ve just read in other machine learning research papers that this approach is often employed.
So every frame of the game worked like this: take a screenshot of the game state, process the screenshot data, feed it through the network, get a probability of going up or down, and then have the computer move the paddle up or down depending on how the coin flip landed. However, this doesn’t have the important part of machine learning: the learning. How the model grows and learns is by being slightly changed and moved depending on how its doing, so every 50 iterations I performed a technique called backpropagation on the model. Traditionally this involves many compounding many partial derivatives, but as I used a single layered neural network I was able to get around it with some tricks to simplify the calculations. Without getting too into the math, backpropagation works backwards through the network, adjusting the weights of each node depending on how the program did. If it was fed the current game data and produced an action that led to a good outcome, that is rewarded and the node is changed in a way that encourages making that decision again in a similar circumstance. The opposite is also true, making a decision in a specific state that leads to a bad outcome is penalized. In this way, the network can compile information about how it does and change accordingly to (very slowly) become a pong master!
This was the basic workflow of my program. It just ran over and over, gradually “learning” how to play pong and recording how it did. It did simple bookkeeping like recording stats and storing its current model every 100 iterations, and just kept chugging along and slowly improving. Now I should mention one thing about the exact definition of an iteration. I trained two models actually, one with an iteration being counted as a point, and the other with an iteration being counted as a game. They both saw little improvement in the first few days, but the first model was the one that in the end had massive improvements while the second model ended up stagnating and not really improving very quickly. As such the demonstrations I’ll be showing are for the first model as that one ended up much better.
Let’s see how it does! (apologies for the potato quality 🙁 , I had to run this on ubuntu as open ai gym doesn’t like windows and couldn’t get a screen recorder working on my virtual machine)
So here is the model without any training. It initializes completely random weights so as you can see in the window the paddle just moves randomly. Sometimes it seems like its following a pattern and rallying with the opponenet, but thats all just variance and you quickly realize that there is no pattern there.
Heres the model after about 2 days training. Its still very random number dependent and doesn’t look like its made any major improvements over the initial model. At this point I was honestly extremely worried. I made a few changes to the code and started training separate models to hedge my bet in case this one was a total dud that wouldn’t improve at all.
Eureka! Here’s the model from 4 days after. While it still isn’t beating the computer, it has made some massive improvements. It can more reliably return the ball and can actually score points that aren’t a fluke! This improvement was massive over the previous model.
Here is the model’s current state. It has been training for a little more than 5 days now and while it still hasn’t beaten the computer, its definitely developing some strategies to get there. I’m extremely proud of its progress thus far and will continue having it open and training until it can actually beat the computer and eventually do so consistently.
So let’s discuss a little bit more about what we just saw. The model has learned how to return more consistently, but if you look at its scores, they all come from “smashes”. In pong, returning with the edge of the paddle increases the speed of the ball and reverses the angle of trajectory, while returning with the middle of the paddle gives a slower return with a simial mirrored trajectory. The points that the model is scoring all come from these smashes. Here we’ve witnessed it really “learn” something and start to adopt a strategy, and the reason for this makes a lot of sense.
When we talk about machine learning we don’t really talk about learning in the way that humans do it yet. Machine learning is based on doing the same thing over and over again until it figures out the right method. Humans on the other hand use a lot more contextual information and adapt from their previous experiences, I don’t need to walk into a wall 100 times just to find a door in a new house I’ve never been in. Machine learning is about giving a computer a set of constraints, a set of available actions, and a fitness score. This fitness score is a measure of how good its doing, as it does more desirable things, its fitness goes up. So the model just crunches numbers to figure out how to get the highest probability of doing something that makes the fitness score go up in any given amount of states or similar scenarios. The only fitness score I implemented was based on actual game score, and here lies the problem. The model’s opponent is REALLY good. The opponent can measure where the ball is going to be and will always move there provided it can move fast enough (it does have a constraint on movement speed so it can’t just teleport).
When you or I start playing pong, our goal is just to return the ball over and over again, eventually we might figure out some nuances, but our goal is just to not let the ball get past our side. The model however, doesn’t care as much about not losing as it does about winning. Not losing is more a side effect, so it keeps doing things until it scores points. The only way to really score against its opponent is to smash the ball and get it moving quicker than the opponent can move there. Once this sequence of actions happens often enough and is rewarded accordingly since it scores points, the model will fall into this strategy. Eventually, I’m pretty sure the model will just figure out how to get a single return on the initial serve, and then smash the next pass so the opponent can’t get it. Now I could fix this by including a secondary fitness that tracks rally time, giving a better fitness to properly returning the ball. This would allow it to seem more like a human playing the game.
Overall, I am extremely satisfied with the results of my work. The model still has a while to train before I would call it successful, but watching it gradually grow and waking up to it learning a new trick or strategy is extremely rewarding. I will be continuing this project and gradually scaling up into a larger more experimental work for my final.