Build an object detection model with Amazon Rekognition custom labels and Roboflow
Mark McQuade
Computer vision technology is making a difference in every industry — from ensuring hard hat compliance at construction sites, to identifying plants vs. weeds for targeted herbicide use, to identifying and counting cell populations in laboratory experiments.
By training computers to interpret the visual world as well as — or better than — humans can, we can quickly identify and classify objects and automatically take actions based on that information. This makes it possible to improve workplace safety, protect our environment and accelerate innovation, across industries.
Computer vision problem types
Although computer vision output is relatively simple (“This person is or is not wearing a hard hat correctly at the construction site”), training the computer vision backend can be challenging. It must be able to accurately identify and organize objects according to multiple factors, such as:
- Classification: “This is a person.”
- Classification + Localization: “This is a person at a construction site.”
- Object detection: “There are two people, plus one hard hat, at the construction site.”
- Semantic segmentation: “There are two people, plus one hard hat, and this is the shape of each.”
- Keypoint detection and pose estimation: “There are two people. One is wearing a hard hat, but it is not positioned correctly. The other is not wearing a hard hat at all.”
To get the right output, you need the right input. And that generally requires seven important steps. Let’s walk through those.
The seven steps of training an object detection model from scratch
1. Defining the problem
Start by defining exactly what you want to do. What is your use case? This will help guide each of the steps that follow.
2. Data collection
Next, you’ll need to collect photos and videos that are representative of the problem you’re trying to solve for. For example, if you’re aiming to build a hard hat detector, you’ll need to collect images of multiple hard hat types, as well as settings where people may be wearing hard hats. Remember to provide images in a variety of conditions: bright vs. dim, indoor vs. outdoor, sunny vs. rainy, people alone vs. in a group, etc. The better the variety, the better your model can learn.
3. Labeling
There are dozens of different image annotation formats — with image labels coming in all shapes and sizes. There are popular annotations like Pascal VOC, COCO JSON and YOLO TXT. But each model framework expects a certain type of annotation. For example, TensorFlow expects TF records, and the recognition service expects a manifest.json file that’s specific to AWS annotation.
So, above all, make sure that your images are labeled in a consistent format that your model framework requires. And use a tool like Amazon SageMaker Ground Truth to streamline the process.
Some labeling tips to keep in mind:
- Label around the entirety of the object. It’s best to include a little bit of non-object buffer than to exclude a portion of the object within a rectangular label. Your model will understand edges far better this way.
- Label hidden/occluded objects entirely. If an object is out of view because another object is in front of it, label the object anyways, as though you could see it in its entirety. Your model will begin to understand the true bounds of objects this way.
- For objects partially out of frame, generally label them. This depends on the problem you’re trying to solve for. But in general, even a partial object is still an object to be labeled.
4. Data pre-processing
Now is the time to ensure your data is formatted correctly for your model — resizing, re-orienting, making color corrections, etc., as needed. For example, if your model requires a square aspect ratio, you should format your photos/videos to fill a square space — perhaps using black or white pixels to fill the empty space.
You will also want to remove EXIF / metadata from your images, since that can sometimes confuse the model. Or if you want the model to be insensitive to color (e.g., it doesn’t matter what color the hard hat is), you can format your images/video to be grayscale, to eliminate that factor.
5. Data augmentation
Next, you should apply different formatting to your existing content, to expose your model to a wider array of training examples. By flipping, rotating, distorting, blurring, and adjusting the color for your images, you are, in effect, creating new data.
So, instead of having to actually take photos of people wearing hard hats in different lighting conditions, you can use augmentation to simulate brighter or dimmer room lighting. You can also train your model to be insensitive to occlusion, so that it can still detect an object even if becomes blocked by another object. You can do this by adding black box “cutouts” to your photos/videos, to help train your model.
6. Training the model
To train your model, you’ll be using a tool like Amazon Rekognition Custom Labels, which will process the inputs you’ve created during the first five steps. You’ll need to decide, though, what’s most important for your use case: accuracy, speed, or model size? Generally, these factors trade off with one another.
7. Inference
Now it’s time to actually put your model into production. This will vary depending on the type of deployment. For example, will you be using embedded devices — such as cameras on a factory line? Or will this be a server-side deployment with APIs?
See the process in action
Last year we recorded a webinar, where we walked through how to use Amazon Rekognition Custom Labels with Roboflow to deploy a system that can detect whether or not people are wearing face masks. You can apply the same steps to your own object detection models, to serve your own use cases.
Watch the webinar on-demand to follow the end-to-end process of creating a working object detection model.
Recent Posts
Dispelling Myths About Running OpenStack Clouds
August 19th, 2024
Why You Need Proactive Modern Operations in a Complex IT World
August 7th, 2024