Introducing (pause) ... Recipeat!
Having lived in Paris and spending many summers France, this is often my reaction when my fellow Americans ask me, "what exactly is French food?"
What most of them really mean (I hope) is that French recipes are notoriously complicated. That is, you can't just boil up some pasta and unload a can of tomato sauce (no disrespect to Italian food) and say, "voila!" Quick thought experiment: what came first, the consensus that pasta is easy to make or fact that everyone just happens to have pasta and tomato sauce lying around their kitchen? I want to show that french food isn't necessarily more difficult to make, people just need to be inspired to cook with common ingredients they likely already have at home.
"But... aren't there already plenty of recipe generators online?"
Oui. But who wants to manually input each ingredient one by one?
I wanted to keep this tool so simple that even a millennial could use it. With the Snapchat culture in mind, I wanted the user interaction to be as follows:
- Lay out all the ingredients (or leftovers!) you want to cook with on a flat surface.
- Take a single snap through the app.
- Select a recommended recipe that uses all these ingredients.
Admittedly, the recipe recommender portion of the task at hand was already done for me through the Yummly API. I just had to provide it with a list of all the ingredients that were present in the snapped photo.
This naturally called for a multi-label image classification model using convolutional neural networks (CNN).
The most elegant way of solving this problem would be to deploy object detection (where different things are within an image) and object recognition (labeling these objects accordingly), and then use a Regional-CNN. However, this requires an existing database of images (see: Microsoft COCO) that has various combinations of objects and their bounding box location. There's also the problem of occlusion, like the person's right leg hiding behind the horse's body.
THE BAD NEWS: No dataset with different combinations of common "french" ingredients or their locations, so Regional-CNN (R-CNN) was no longer possible.
THE GOOD NEWS: I didn't care about where the different ingredients, or objects, were in the image—just the labels. Also, the user is supposed to snap the photo with the ingredients spread out on a table, so I didn't have to worry about occlusion either.
As articulated by researchers working for the Institute of Electrical and Electronics Engineers (IEEE), the jump from a single-label image classification problem to a multi-label problem is not trivial. In their paper "CNN: Single-label to Multi-label," they note four key strategies to circumvent this lack of triviality.
- The CNN must be flexible to be pre-trained with a single-label image dataset like ImageNet.
- The training portion must not require ground-truth bounding box information.
- The model must be robust to possibly noisy and/or redundant hypotheses.
- The model must naturally output multi-label prediction results.
For a quick crash course on CNNs, I highly recommend this Beginner's Guide to Understanding Convolutional Neural Networks. And if that wasn't all-you-can-eat, check out these notes from the Stanford CS231n course CNN for Visual Recognition, which was taught by one of the world leaders in computer vision Andrej Karpathy.
Strategy 1: The CNN must be flexible to be PRE-trained with a single-label image dataset like ImageNet.
Before I complicate the problem, let's first understand the basic workflow of a single-label image classification problem. You can think of it as analogous to the way the neurons in our brain allow us to learn and then later recognize something. The more and more we are exposed to what we're told is "steak," the more confident we can label something as a piece of steak in the future.
But not all steaks are created equal, and quite frankly, neither are bananas. When we were learning common fruits as babies, someone surely waved something similar to 5 or 6 on the left and repeated "ba-na-na." The neurons responsible for learning were registering things like shape, texture, color, and orientation. And even though a more curved, green, and tougher looking banana (like number 1) might have stumped us at one point in time, when we learned that it's still just a (unripe) banana, we were able to expand our ability to recognize a similar object in the future. Artificial neural networks work in a similar fashion: show a computer a lot of different variations of a banana and it will "learn" how to predict and label this object in the future.
As stated, my model had to be pre-trained with the single-label image dataset ImageNet. This means that when I used Keras, a Python deep learning library, to re-train the top fully-connected layer of my convolutional neural network, I had to do so using single-label images as well. For this, I engineered my own labeled dataset of images by writing a Python script that used Selenium to scrape google images. Below are example images of some of the common french ingredients that I wanted my model to be able to recognize. My model is able to recognize up to thirty ingredients.
A non-exhaustive sample of the images my model saw and then learned during the training phase.
Strategy 2: The training portion must not require ground-truth bounding box information.
Let's just say I'm stubborn and try to feed an image of multiple ingredients through a model that was trained with only single-labeled images. What would happen? Since the image dataset during the training phase didn't have ground-truth bounding box information, the model wouldn't know which regions of pixels represent a distinct object. Instead, it would try to naively aggregate pixels thinking that the image always contained one ingredient. So, there must be more intermediate step(s) before final predictions can be made!
Strategy 3: The model must be robust to possibly noisy and/or redundant hypotheses.
What I proposed was essentially a similar technique to what papers on Regional-CNNs discuss: the concept of "sliding windows." Starting from the top left and ending up at the bottom right, I created a cropping mechanism where each "crop" (or window) represents a portion of the entire image. This cropping mechanism would "slide" from left to right and up and down, ultimately creating a fixed number of new images. But wait, "aren't there going to be a lot of both 'noise' from the crops of blank spaces between ingredients and 'redundancy' because of overlaps?"
We can think of these windows as a prediction or "hypothesis" of what ingredient my model is most confident the crop represents. Anticipating that it would be confused when presented a crop of the blank space between ingredients, I created a "noise" class with tabletop images. In short, since my model has learned the likes of tabletops, it will output a "noise" label that can then be excluded from the final list of ingredients. As for how I handle redundancy, keep reading!
Strategy 4: The model must naturally output multi-label prediction results.
To recap: after all the cropping and sliding, these new representations of my original image could finally each go through my trained convolutional neural network. But there was still one slight issue: how could I naturally output multi-labeled results from a bunch of single-label CNN predictions?
In what I'm calling a "voting" scheme, each "hypothesis" or prediction of the cropped window gets the equivalent of one "vote." Addressing the issue of redundancy, the more votes that are casted for the same ingredient is a "good" thing because it means that the ingredient represents a more significant portion of the original image. Also, the fact that the model is making repetitive predictions only serves to improve the overall confidence for the final list of ingredients.
Now, we're just left with guessing how many ingredients are present in the original image. For this, I created a threshold value in which one of the thirty possible ingredients made the final list only if it met the threshold value. If you're curious to find out more about my voting-threshold algorithm, it can be found on my GitHub!
Voilà! With the list of ingredients generated from an image recognition algorithm, the recipes using these ingredients can be recommended! For those interested in trying out RecipEat for yourself, the web hosting is coming soon!
The code for my project can be found here. Enjoy!