Unnamed Mobile App*
Prelude.
I have a confession to make: I have been programming--making stuff--off and on since I was like 10 years old. In all that time, I have never made a mobile app (except for a couple super basic ones I made in App-Lab back in high school). The reasons for this are multifaceted. First, I will not lie, is a lack of ideas. 99% of the time I get an idea, it is either for a game (I don't really play mobile games), or a desktop application. The other 1% I either sleep on and realize they're bad, or somebody has already made it. Another reason I that in general, making an app is a nightmare. Tom Scott outlines it beautifully in this video, between privacy concerns, platform compatibility issues, and server costs, running as a solo mobile app developer seems like a herculean undertaking.
So I'm going to [try to] do it anyway.
For all the entries on this blog thus far, I have planned the project, started the project, and finished the project, all before doing the write-up for the blog. This one is going to be a little different. I have no idea how long this will take, or really what I am getting myself into, so I am going to document my app-making journey in real time. There won't be a schedule or anything for these entries, I fully expect this project to take me several months (or more depending how busy I get in the new year).
Step 0: Background.
I have wanted to do something with computer vision for some time now, but I never have come up with an idea I really liked (well there was one, but electrical engineering terrifies me...maybe another day). That was until November 2022, when OpenAI unveiled ChatGPT. And it blew my mind.
For those of you unaware, ChatGPT is a chat-bot implementation of OpenAI's GPT-3 text generation model. In plain English, it is a chat bot that can answer questions, write code, discuss abstract ideas---I would even argue that it passes the Turing test. It's writing is so good, and so natural, it's uncanny. It is quite literally the closest thing we have to Tony Stark's J.A.R.V.I.S. right now. It's insane.
Immediately, the gears in my head started turning. The possibilities for this technology are endless. I acknowledge that this may age poorly, but as of writing this I believe that this is truly the next step in the evolution of computer technology. We are going to look back at the advent of ChatGPT the way we look back at the advent of the search engine today.
Step 1: The Idea.
Naturally, when presented with this mind-boggling technology, one of the first things I did was try to get it to roast me. This actually proved rather difficult. After Microsoft's debacle of the internet quickly turning their attempt at an intelligent chat bot racist a few years back, OpenAI rightly implemented some precautions in ChatGPT/GPT-3. For starters, the model was trained on a static dataset and does NOT have access to the internet. Secondly, there is a hefty amount of input sanitization going on, to prevent the bot from saying anything at all that is morally questionable. This means that if you want the bot to toss you a complement, it will do it without hesitation, but if you want it to try and knock you down a peg...it needs some convincing. But, it can be done. This means game on.
Back in 2018, a Youtuber by the name of Michael Reeves made a morally questionable animatronic Elmo doll. In doing so, he utilized a computer-vision API that took in a picture of someone's face, and returned attributes about that person, which included their ethnicity. I plan on using this same API in order to learn about the user's appearance.
Every app needs a purpose, the purpose of this app is to help the user control their mood. The lifecycle of the app then is as follows: the user takes a picture of their face, and selects whether or not they would like to be roasted or complemented. The app then uses computer-vision to make a note of all the person's facial features, and then pass those over to GPT-3 in order to generate a personalized insult or complement.
The app therefore will consist of the following components (I'm guessing, based on knowledge primarily obtained from FireshipIO videos):
A cross-platform (iOS/Android) client for users to submit their picture and choose whether they would like to be roasted or complimented.
Some sort of cloud backend to handle the computing.
The cloud backend needs to:
Extract the user's face from the image.
Categorize their facial features.
Turn these features into a prompt for GPT-3
Use GPT-3 to generate a complement or insult.
Deliver the result back to the client.
Now that I know what the app needs to do, it is time for the simple task of learning how to build a cross platform mobile app. Time to go do a bunch of reading, I suppose.
Step 1: Choosing a Framework
I have been doing a bit of reading, and I've learned that making a mobile app is much, much less straightforward than making something for desktop use. The good news is there are frameworks that you can use to make the experience far less painful. The bad news is that I have come to the conclusion that there are more app-frameworks than there are stars in the observable universe, each with their own hip name or acronym. So I have spent the last couple days learning about the various frameworks and have come up with a handful that could work for this project.
Option A: Xamarin
Xamarin is a iOS and Android app framework made by Microsoft. As a primarily C# developer, it presents a compelling option, as it allows you to make your apps using C# and .NET. However, it's relative lack of popularity makes me worry about troubleshooting issues and a lack of third-party plugins/documentation. Also, C# is a language that I am quite comfortable with, and I'm really looking to branch out and diversify my skillset with this project.
Option B: Flutter
Flutter is Google's entry into the cross-platform application field. You know it's good because its name is a verb, and Flutter's popularity reflects this. Flutter apps offer great performance, there's fantastic first-party documentation, and there's a wide of plugins/modules. Flutter's programming language of choice is Dart, which strikes me as the illegitimate child of JavaScript and Kotlin, so I would need to learn that. With this one, I even got as far as downloading it and setting it up, but something about it just kept striking me as completely overkill for this project. If I creating something enterprise-grade or particularly concerned about performance, I would definitely use this. But for something as (I hope) simple as the front-end app for this project, I believe there are better options.
Option C: React Native
React Native is Meta's answer for cross-platform mobile app development. React Native allows you to create a native mobile application using HTML, CSS, and ReactJS. In essence, you create a web app that is then compiled and run natively on the device. The upside to this is ease of development--it's nearly identical to making any other app in React. The downside is performance. JavaScript is known for many things, high performance is not one. However, for my use case, which consists of a relatively simple client with most of the processing done on the backend, I'm not super concerned with performance. The general consensus among mobile developers seems to be that between React Native and Flutter, Flutter is definitely the way to go both now and in the future, especially if you're serious. But I just can't seem to find a way to argue against the simplicity of React Native for this particular project. Plus, if I ever want to turn my app into a web app, it will be a trivial process with React Native. For those reasons, I am choosing this framework to build my mobile app in. I might regret it, we'll see. Could always bail and switch to Flutter if React leads to me wanting to tear all of my hair out. Also, this gives me the excuse to learn React properly, which seems to be a good idea in the job market at the moment.
Step 2: Design
I believe an important part of any sort of software development is to sit down and figure out the user cycle of your project. That is, what is the app going to do, and what is the user going to see at any given time. Personally, I find doing this both gives me an outline for what I need to make, as well as helps to keep the scope of the project in check. So I sat down for a few minutes and created this diagram for my app:
I...umm...never claimed I was an artist. However, my terrible drawing skills have allowed me to visualize that I will need five screens on the frontend, and the backend will need to do exactly two things. Notice that I omitted a login screen. I decided against requiring users to login for two reasons: First, I think it gets in the way of the the user experience. Secondly, I have no intention of storing user data at all. There are several reasons why I do not wish to store user data, but it boils down to servers and privacy. I am a broke college student. Servers are expensive. The fewer servers I need for this project, the better.
Secondly, there is the issue of data privacy. This is an app that users will be uploading (hopefully) pictures of their faces to. To a lot of people, that is personal. I have no intention of selling any data generated to advertisers, and I feel like holding on to the images will just open me up to a lawsuit. So any images uploaded to the app will be destroyed immediately after analysis.
Now, with the design of my app complete, I guess it's time to start coding? But first...
The App Needs a Name.
Since this entire project was spawned by my fascination with ChatGPT, I figured what better to name the thing. So I asked ChatGPT to name my app.
Going to be honest, most of those suggestions are terrible. "Face Finder" got a chuckle out of me because in my heart I am 12 years old. But there's something about the way "Mirror, Mirror" rolls off the tongue that I just kinda like. A quick search on the App Store and Play Store reveals that there aren't really any apps with that exact name, so I think I'm going to go with it, at least for the time being.
Setting up my Development Environment.
In the last entry, I talked about my experience in picking a mobile development framework, and I said that I had settled on React Native. Shortly after I wrote that, I embarked on setting up my development environment for React Native. This is where I began running in to issues. I know from prior experience that developing JavaScript applications on Windows is just sort of...annoying? I think it boils down to Windows CMD/PowerShell not being the industry standard tools, and therefore needing to translate commands on the fly from Bash on a semi-regular basis. Thus, I opted to utilize VSCode's remote development feature in order to develop everything natively in Ubuntu via WSL. In my own experience, particularly in WSL 2, using it in this way to essentially replace the Windows command line with Bash has been a relatively seamless experience. This changed when trying to get a React Native project up and running. I found myself running in to countless errors while following the guides on React Native's website, to the point where several hours and two dozen chrome tabs in, with no end in sight, I decided screw it, let's give Flutter a shot. Less than half an hour later I had Flutter fully set up and working, and had even managed to build and compile the example app to my phone. Furthermore, I find Flutter's language of choice, Dart, to more enjoyable to write than JavaScript, so I think I'll stick with Flutter for the rest of the project.
I think my favorite part about Flutter at the moment is how braindead simple it is to test your app. It's literally a menu where you choose what you want to display it on, whether it be your desktop, a phone emulator, or even your actual phone. No hassle, no real setup (you just toggle one option in your phone's settings) and it just works. It's brilliant.
Additionally, I'm amazed at how fluid rapid prototyping is within Flutter. I followed a handful of tutorials in order to get the feel for the framework, and everything, in particular the external package manager, are just super easy to use and relatively hassle-free. I think the best example of this was when I was accessing the camera on my device. Going into the project, this is something I assumed would be a nightmare of figuring out what device we were on, figuring out how to use the camera driver, etc. Turns out that in flutter, this is taken care of with one import and half a dozen lines of code. Super convenient.
That's the update for today, I'm going to get back to learning how to lay out a UI in Flutter.
Laying out the UI.
Flutter's UI system is comprised entirely of widgets, and I am really surprised at how quick and painless it is to lay out a basic UI. Obviously, all the images I used are placeholders, but still I don't think that the main menu prototype that I whipped up is half bad. Additionally, I implemented a Splash/Loading screen in literally two minutes. Prototyping is helped dramatically by Flutter's hot-reload functionality, which allows me to make changes to the code, and then simply press a button to see them reflected on the device, without the need to recompile/reinstall. As for the device...I am sure you can tell by the screenshot that I am testing this on Android. Flutter is cross-platform, and everything I've read seems to imply that it should work fine on iOS, but I don't have access to an iOS device to test on. For a while, I was testing using an emulator, but my laptop was starting to struggle running both the emulator and VSCode. For that reason, I have appropriated my old Essential-PH1 as a dedicated testing device. I don't remember if I ever unlocked the bootloader on this phone, but getting it set up was as simple as enabling USB debugging. I must say I really enjoy having the dedicated testing device, as it both frees up screen real estate on my laptop, and its also pretty cool to actually be able to hold the app you made in your hand.
Writing UI code in Flutter feels to me like what HTML/CSS should be. Or maybe I just haven't had enough time with Flutter to resent it yet. I think the reason for this is the Dart language. It strikes an excellent balance of being verbose enough to be relatively easily read, but doesn't go full Java with "public static void main (String[] args)". Given Flutter's cross-platform nature, I think that next time I write any UI application I will probably do it with Flutter. The combination of hot-reload and easily accessible premade assets is borderline unbeatable.
Starting the Backend.
I've got the UI for the app mostly functional now, which means that it is time to begin work on what gives this app its functionality: the backend. The backend of the app will perform the following: identify the face in the image, determine notable characteristics about the face, then use those characteristics as an input to GPT-3 to roast or complement the user.
To write the back-end, I am going to be using Python, primarily for its easy to write, elegant style that I have really grown fond of lately. This is going to result in a performance hit compared to a language like C++, however given the expected scale of this app I do not foresee it being a major issue.
Facial Detection.
In order to perform all of the image recognition for this app, I am going to be using a library called OpenCV. OpenCV is an open-source computer vision library that is useful for doing loads of different stuff. We'll start by setting up OpenCV and using it to get some basic facial detection up and running. The first step is installing OpenCV. Python has an easy to use package manager called Pip, so we can install OpenCV like this:
With OpenCV now installed, all we need to do is import it into our program as such:
OpenCV uses a system called Cascade Classifiers in order to perform image recognition. Essentially, a Cascade Classifier is a trained algorithm that tells OpenCV how to recognize something. You can read more about it here if you want. In practice, it's a roughly 33,000 line XML file that my puny human brain has no hope of understanding. Fortunately, OpenCV has a bunch of ready made Cascade Classifiers that you can download and use. So, after downloading the one for faces, we set it up as such:
Here is where I ran into an issue: for some reason, Python refused to open the cascade file, citing a permission error: [ERROR:0@0.028] global persistence.cpp:505 cv::FileStorage::Impl::open Can't open file: 'haarcascade_frontalface_default.xml' in read mode. This error was tricky to pin down because one, there weren't any ten year old forum posts about it, and secondly, it is a LIE. Well, sort of. I believe that this is an issue with Python's sandboxing. For my development environment, I use VSCode. As such, I had the main project folder open, and my Python code was nested in a subfolder (the same subfolder that contained the cascade file, so that was not the issue). To test my program, I was using the 'Run Program' hotkey in VSCode. The issue seemed to be that my program was being run in the main project folder, rather than the subfolder containing the cascade file. There are two easy remedies to this: the first is to simply open the subfolder in VSCode instead of the main folder. This will allow the run program button to function as intended, but it may prove to be a hassle if you need to work on files outside of the subfolder. The second solution is to navigate to the subfolder in the terminal (ex: cd ./lib), and then run your program from the command line (ex: python myprogram.py). Both of these solutions will result in the error being resolved, and I figure I may as well preserve them here for posterity. You're welcome, weary traveler 10 years from now.
With that minor annoyance taken care of, it's time to get an image to detect a face in. In the final app, we will be working with an image that was sent over from the user's device, but for the time being we will get images from the webcam on my laptop. A video is nothing more than a series of stills, so we will get the video feed from the camera like such:
We now have an image from the camera, and it's time to analyze it for faces. There is just one small problem, the image we have is in color, and color creates a lot of overhead for computer vision. To remedy this and improve performance, we will convert the image to grayscale:
With our new grayscale image, it's time to use it to finally perform facial detection. For this, we will use detectMultiScale(). detectMultiScale takes in three arguments: the image to process, the scale factor, and the number of neighbors. For the image to process, we will pass it our grayscale image. The scale factor specifies how much to reduce the image size by at each scale, all you need to know is that the higher this number, the faster the detection will work, albeit with lower accuracy. The minimum value for the scale factor is 1.0.
The number of neighbors is how many neighbors each possible detection needs in order to not get thrown out. Higher numbers reduce the chance of a false positive, but can lead to more false negatives.
After a bit of playing with the values, here's what I came up with:
And in terms of facial detection, that's literally it. The variable face now contains the (x,y) coordinate of the face (if detected), as well as the width and height of the face. To help us visualize it, we'll go ahead and display some text regarding detection status, as well as draw a rectangle around the face before displaying it to the output window:
All that's left is to listen for a keypress to terminate the loop (I picked the escape key), and clean up after ourselves:
Success.
We now have fully working facial detection! The next step is to identify the locations of individual facial features, and gather information about them. I'll get started on that now I suppose.
Detecting Eye Color
The first facial feature I opted to try to detect is the color of the person's eyes. This meant first detecting where the person's eyes were. Fortunately, there is a ready-made Haar cascade for detecting eyes, so we import and use it the same way we used the face cascade last time:
After testing with this method, I was shocked to learn that I in fact have three eyes. Or at least it thinks I do. It turns out that the default Haar cascade for eyes isn't very good, but fortunately there is a much better one, intuitively named "haarcascade_eye_tree_eyeglasses.xml". Switching to this cascade rectified the issue.
Now I knew where the person's eyes were in the image, the next step was determining the color. I first went with the straightforward approach of "pick a point in near the center of the eye, and get the color" as such:
This didn't work. At least not well. The issue was not necessarily with the code, but with the lighting in the image. It turns out eyes are reflective, and more often than not my code would end up sampling the glint in the eye. Additionally, a person's iris, in what I consider to be a design flaw, is not uniform in color. Therefore, if my code not to sample the glint, and happened to also not sample the white of the eye (which would happen if the person was not looking at the camera), the color it would get would be unrepresentative of the whole eye color. I needed a new approach.
Instead of simply sampling a point on the eye, we can cut out the eye and analyze the colors in the entire image with a histogram. Thus, the peak of the histogram would (theoretically) correspond to the color of the person's iris. We can do this as such:
And this worked! At the very least, the color it outputted was close enough to my actual eye color for me to be happy with it. There was just one, small problem. This just outputs the HSV value of the person's eye color. In English, this is meaningless. Nobody has HSV(247, 100%, 100%) eyes. They have blue eyes. Which means we need to figure out how to give the color a name. Fortunately, we have industry standards. The standard for HTML contains over one hundred predefined and named colors, so we are going to take the iris color, and match it to its closest counterpart on the HTML list. Computers typically store color in terms of red, green, and blue values. Therefore, the color of this text would be (180, 126, 91). We can use these RGB values as a coordinate space, and then use the Pythagorean theorem to determine the distance to a known color value.