Project Leaders: Elijah Grubbs, Kedar Tripathy
Team Members: Jiwon Kim, Seong Taek Kong, Sreya Gogineni, Cameron Burdgick, Aishwarya Arvind, Pascal Rhee, Iaroslav Kovalchuk, Jay Shaver, Benjamin Brown
ABOUT THIS PROJECT
The goal of this project is to find a state-of-the-art machine learning paper, learn about what it takes to implement it, and finally, recreate the findings in the paper. With those goals in mind, we understood going into it that this was a more freeform process than other projects typically follow. Although many of us did not achieve in full all the goals we set for ourselves at the beginning, we all enjoyed the process. And isn’t spending time involved in what you’re passionate about what it’s all really about?
YOLACT: Real-Time INSTance Segmentation
There are many categories of computer vision. The categories that build up to YOLACT include segmentation and classification. Image segmentation seeks to label each pixel according to what it represents. Image classification tries to draw a bounding box around the image. Memories of The Terminator or RoboCop probably just ran through your head. But the Sci-fi of yesterday is antiquated today thanks to YOLACT. Instance segmentation combines the two and classifies each pixel of each object in the image; but unlike semantic segmentation, multiple appearances of an object are treated as separate entities (instances), hence 2 different dogs are identified in the far-right image.
YOLACT is also super fast, so fast that it can render segmentations in real-time. The video embedded on the right gives an excellent demo of YOLACT. Both the bounding box and segmentation masks are present, and it also assigns a percentage of confidence to each object. Halfway through the video, it thinks a bunch of rocks are sheep, albeit with only around 20% confidence.
How YOLACT Works
The beginning of the process starts with a fully convolutional neural network that produces a set of image-sized "prototype-masks" that do not depend on any one instance. Imagine taking the one picture we had originally to analyze and then creating 5 new ones to help analyze the original.
This pooling layer creates the prototypes you see in section 4.
and 4. Created by the pooling layer in section 2, this added Prediction Head contains a vector of "mask-coefficients" that encode an image's representation in the prototype space. Look at the graphs for Detection 1(The person) and Detection 2(The racket) have graphs. You can see that Detection 1 has high values for all but the last bar. Now look at the Prototypes in section 4, see how the person isn't highlighted in the fourth space? That's what the mask-coefficients encode. An important final step is to pass all of our candidate prototypes (the bundle of four images shown in the Protonet) through Non Maximum Suppression. Basically we only care about the Prototype that best segments the image.
5. This step is where we wrap it all up. For each instance that survives NMS, we construct a mask for that instance by linearly combining the work of these two branches. Crop and Threshold are just the final stages of cleaning, getting rid of the junk around the image so all that is left is the segmentations that can be overlayed on the original to create the final result shown on the right.
What We Learned
Through this project, the most important thing we learned was that working on something does not guarantee success. Said differently, when we sink time into something, we expect the payout to equal to all the time we put in. Ending this project without a final deliverable makes it tempting to label everything we did as useless. Although our failure was due to physical constraints and not a lack of understanding, we've realized that in the scope of learning to explore your own curiosities, the journey is the destination. Many members went from zero experience in Computer Vision to understanding YOLACT in-depth and the fact our deliverable did not have enough time or the hardware capacity to train does not change that.
Where We started
We had a group interested in finance, and a group interested in reinforcement learning. The paper we found to satisfy us both can be found here. Unfortunately, after looking closer it was just a library, like pandas. Not much to implement from that. So instead we went through the references and found this paper (the paper is on the Github, and the code is there too)! It seemed at first like we had found our paper.
The paper had an introduction to Reinforcement learning, a more complicated algorithm that they implemented (with pseudocode!), and completed code for reference. We thought that with all of that, it would be both fun and rewarding to try and follow the paper. As it turned out, the introduction went from 0 to 100 in 3 pages, throwing out terms and symbols with no definitions or in-text reference. Trying to go through the rest felt fruitless because of our lack of a basic understanding of Reinforcement Learning. Because of this, again, we went into the references to try and see if we could find where to build that foundation.
A Final Switch to... Textbook Implementation?
The very first reference in the second paper was Reinforcement Learning: An Introduction by Sutton and Barto. Little did we know that this book is the most famous and most frequently taught-from resource for Reinforcement Learning. We found a new goal. Read the chapters in the book, and then implement the chapter problems or examples given. We got through 4 chapters of the book, of which only the last two had things to implement. However, the first 2 chapters were invaluable in building our understanding of the Reinforcement Learning problem: its trade-offs, benefits, etc. Below is a run-through of what we learned in the last 2 chapters, and how we chose to show what we know.
On the right, you can see the most fundamental concept that encapsulates the use and power of RL. Given an environment (the stock market, a road, anything really) and an agent (literally YOU, or the thing you code), you can interact with the environment, which changes your state and gives you a reward.
If a problem has a small enough number of states and actions to take us between states, we can fit our whole RL process in computer memory. Using Dynamic Programming techniques lets us figure out how best to navigate through the environment. Consider the problem below. If you (the agent) were dropped randomly in this gridworld, what policy (the thing that determines your actions) would you follow such that your reward is minimized? The environment gives you a -1 reward every time you take a step.
The "Optimal" Policy
Clearly, the best thing to do would look something like the grid with arrows shown on the right. But how do you get a computer to generate that answer? Since our whole environment fits into memory, and we have a full model of the environment, there are certain strategies we can employ to solve this problem.
In order to keep this light, here is a brief summary of the widely applicable framework called "General Policy Iteration". Pick a random policy (pi) to begin with, take advantage of the computer you are using to calculate how good it is to be in each square of the grid if you followed that policy. For example, if your policy was to always go to the right, 14 would have a reward of -1, 13 of -2, 12 of -3 and well you're smart, I'm sure you could fill in the rest.
Once your policy is evaluated, you can start making intelligent decisions, you can CHANGE your policy. Simply pick the best option of the available to build an even better policy!
If you've been paying attention, you might notice that the "optimal policy" shown at the top isn't fully correct. After following this approach, our computer program generated a different policy. Below, you can see our code output on the left. You can read the matrix like this: "What is the probability I move from square [row] on the grid to square [column] on the grid. As you can identify for yourself, our policy is also optimal.
Monte Carlo Methods
For problems with too many states and actions, you will want a more lightweight way to run general policy iteration (GPI) as described above. The way Monte Carlo Methods work is by taking several episodes (the state, action, reward pairs from start to finish) and simply averaging the discounted sum of all rewards gotten after being at the state and taking a certain action.
A way to ensure you figure out how good it is to be in ANY state, we implemented the "Exploring Starts" algorithm. The only thing it does it give each non-terminal state the chance to be where you begin an episode. Instead of solving the same problem above, we went off the book and made our gridworld.
We thought it did not make sense to let you run into walls, or the blocks, so we prevented the computer from considering those actions in the random policies you generate at the start.
We did not have time to generate the arrows like were provided for the dynamic programming problem, but it would be very easy. You can do it yourself! Just pick any square, and go in the direction that is the least negative (not zero, as that, is a restricted movement).
Although we only got through 130 pages of the textbook, we really enjoyed the process. We learned a lot about Reinforcement learning, and have really set ourselves up well to pick up and understand additional literature regarding the topic.
Image background subtraction can be framed as a Semantic Segmentation problem, which is well studied in the field of Computer Vision. The task for a semantic segmentation model that aims to perform image background subtraction would be to predict whether a given pixel in an image belongs to the class ‘person’ or not. Sometimes referred to as ‘Portrait Matting’, this task aims to predict an ‘alpha matte’ that can be used to extract people from an image. The output of the model is a mask where the pixels that have been predicted to belong to the background are colored black, while the pixels that are predicted to belong to the person are colored white.
What We Did
MODNet (Ke et al. 2022):
Architecture of Modnet
We decided that implementing this model would be out of scope for this project. We used this model to understand how to deal with images as inputs and the necessary preprocessing needed as we moved on to a different paper.
Tiramisu (Jegou et al. 2017)
The 12 classes in the CamVid dataset. Left to right: input image, ground-truth (from the dataset), model output
Architecture of Tiramisu
At the end of training, our test accuracy was about 75%. The model can successfully extract the coarse representation of the semantics, so it can clearly predict which general area of the image belongs to the ‘person’ class.
Results of testing the model after training it
Why did we do this project?
The rate at which new machine learning technologies develop poses a scary challenge for anyone in our field. You must take it upon yourself to stay ahead of the curve, no one will do it for you. But it's often daunting trying to teach yourself something new. And it probably doesn't help that STEM papers are full of equations, Greek letters, and lots of other gobbledygook.
By no means have we created a state-of-the-art implementation of our given machine learning methods. Rather we’ve practiced being a life-long learner. However, after identifying something new, impactful, and interesting, we were able to teach ourselves the fundamentals of the technology. If we were lucky, implement the higher-level stuff from scratch, on our own! I hope that this project will inspire you to dig deeper into <what tech has seemed interesting, cool, and cutting edge, to you>.