Blog: SpotyGuard
SpotyGuard: 2023-24
My university has an Intelligent Robotics lab, and they are kind enough to allow for undergraduates to utilize this lab in order to perform research with the various robots in it, among them is Spot from Boston Dynamics. For my senior research project, myself and a small team decided to utilize Spot in order to understand the potential use cases for such a robot, and to better understand how people might interact with these robots throughout their daily lives.
Abstract.
Boston Dynamics’ Spot robot is a general purpose quadruped robot that is intended to inspect, explore, and gather data about a wide range of environments and terrain, specifically those which have the potential to be hazardous to humans. The sensors that Spot is equipped with allow the robot to have a 360° perception of its environment, enabling Spot to have an unprecedented awareness of its surroundings. In a modern world with modern threats, physical security has become more important than ever, which is why this research aims to examine and test Spot’s ability to patrol a space, and produce an audio alert if it detects any unauthorized personnel in that space. In other words, we intend to teach the $100,000 robot dog how to bark at strangers.
My Part.
For my part in this project, I was responsible for writing code in order to facilitate the registration, identification, and recognition of individuals by their face for SPOT.
The first challenge was obtaining usable data to use for facial recognition. As mentioned prior, Spot does have cameras and sensors on all sides of its body, giving it a 360° perception of its environment. However, the primary purpose of these cameras and sensors is obstacle avoidance. This means that they are low resolution, and thus are ill suited for a task such as facial recognition. Fortunately, there is one camera on Spot that fit the bill: the gripper camera. Spot's gripper houses a 4k color camera, ideal for what we were attempting to accomplish. There was just one slight problem, which was accessing this camera. Although as a whole the documentation for Spot is good, the documentation for retrieving data from the gripper camera was non-existent. Thankfully, I managed to find a work-around by asking Spot to list all attached devices and sensors, and then searching for the correct camera in that list. It turns out that although all of the documentation refers to the thing on the end of Spot's arm as a gripper, the camera id is "HAND_COLOR_IMAGE".
With data acquisition out of the way, it was time to work on facial recognition. A while ago I did a write-up for how to perform facial detection in Python using OpenCV, so my first thought was to use cascade classifiers for each individual facial feature in order to build a model so to speak of an individual's face, however I quickly realized that this was something that Haar cascades were just not suited to. Upon conferring with my group, I looked into an approach involving one-shot learning. The idea was that Spot would store all the faces of known individuals on board, and then whenever a person was encountered during operation, it would take a picture of their face, and compare it to the known individuals using a one-shot learning model. If no match was found, Spot would bark. Although this idea was sound on paper, and I got a prototype of it working on my laptop, it ended up having several issues that ultimately led to me canning it. The first problem was version incompatibility. Although it ran fine on my laptop, Spot runs an older version of Ubuntu and Python, and I found myself doing more wrestling with version mismatches than actually making progress on the project. Secondly, this approach ended up being slower than expected. Between the O(n) nature of the matching process, and the fact that the model was quite computationally intensive, it meant that the performance of this methodology was just bad.
To rectify this, I went back to the drawing board. All of the approaches I had tried thus far had been tailor made to do nothing but face recognition. However, there was a solution that was much simpler and faster, and that was to treat the images of the faces as just that, images. I realized that although the Haar cascades from earlier were ill suited to telling unique individuals apart, they were fantastic at simply identifying if somebody was present. Using this, I would cut out an image of the person's face, and pass it into an open source library based on OpenAI’s CLIP (Contrastive Language-Image Pre-Training) model. This library allowed me to utilize this model to create vector representations of the extracted face images. These vectors were then stored in a vector database I set up. This allowed me to perform facial recognition by utilizing a similarity search within the database, which both had a fast lookup time and innately allowed us to have the ability to remember multiple different individuals at once. I ended up using PostgreSQL with the PGVector extension, as it allowed me to access the database using traditional SQL queries.
This vector-database based approach ended up working super well, and was able to quite reliably distinguish individuals from one another utilizing just a photograph of their face. After that, the only thing I had to do was create wrapper functions for Spot to allow the robot to take images, add new faces to the database, and compare with existing faces in the database to determine if the person it was looking at was a known person or not.
What I learned.
In completing this project, I became comfortable not just with working on the Spot robot, but also libraries such as Psycopg2. As much as I would love to link the GitHub for this project, as the code that makes Spot work is pretty cool to see, this project and therefore the code associated with it is property of the university. Below is a poster that we made to present at the research symposium, if you are interested in how the project as a whole works.