Why Use Video Annotation?
Video annotation is more complex than image annotation and naturally costs you more in time, processing, and latency. Video also contains temporal information redundancies (that are minimized using compression), and hence have a low information contribution to the neural network per frame. So, before going for video annotation, you want to make sure the expected return is clear compared to frame labeling as images.
There are many reasons to go for video annotation. Let’s go over the main ones.
Context Is Spreading Through Multiple Frames
It is often more clear for the person doing the data labeling to analyze a given frame with the context of their previous or future frames. The motion of objects, as it is easily seen using videos will many times be challenging to identify just by looking at a single frame.
For instance, can you detect the car in this frame?
The context of the car becomes much clearer when using the video in the entire context, even on vague frames:
Information is Spreading Through Multiple Frames
Many times the information we look for is not contained within a single frame. The most basic example is human action, where you need the entire scene’s context in order to determine the action.
Is this person taking the product, or putting it back?
The information we ask about is not contained within a single image, the actions are defined through time.
Labeling Efficiency
Another reason to use video is for labeling efficiency where you get many frames for a much lower cost per frame compared to single image annotation. When working on videos, we preserve the context between frames, whereby the annotation process focuses only on the changes between frames. If we add capabilities like object tracking, motion detection, or model assistance, we can gain much more labeled data, in a more cost-effective way. However, you should always ask yourself whether the overall return is justified since each frame adds very little information on top of its previous one.
Labeling QA
In many cases, labeling QA can be done in the video in a much more effective manner since the Data QA engineer can just watch the entire scene and point to a problematic location, reviewing hundreds of frames every minute instead of dozens of frames.
Audio Information
In some cases, the audio can greatly help in understanding the scene. For example, if you try to classify the emotional state of a person, the voice pitch and tone are very meaningful.
Common Challenges of Video Annotation
Data Volume
Videos can be quite complex and contain thousands of styles and an excessive amount of data. Not to mention that the objects in a video are continuously moving and each item must be labeled, or annotated with cuboids, lines, bounding boxes, etc. In fact, the average video can contain 30-60 frames per second, with a 2-minute video containing around 1800 frames. While there are many more frames, overall, it actually takes less time to annotate video since images require loading time. In those videos that are not similar, YOU need to understand the context; however, the similar ones can be annotated far quicker because most of the changes over frames can be tracked in many cases. This is true for bounding boxes, classifications, and key points, but for more complex annotations, e.g. polylines, it can be hard to annotate at all.
For videos, we’ve mentioned how objects are continuously moving, as opposed to a visual file which is just a still image. With an image, it doesn’t allow you to decipher what a person is necessarily doing in the image, whereas with a video you’re able to gather more information. If you’re annotating the whole video, you can see what happens next and correct yourself. Additionally, the process of annotating videos has proven to be 50% faster than annotating images, depending on the video itself.
Frame and Pixel Accuracy in Browsers
Browsers are not built to have pixel and frame accuracy in videos, in many cases not even rendering the video in the same frame rate. (Actually browsers do not understand the frame concept, only time). Models, on the other hand, are very sensitive to pixel errors and work by frame rather than time.
The errors you might expect can be clearly visible – for instance, when a video is not playing, is stuck, or silent. Silent video errors are a source of great challenge since while everything looks great on a brief view, a more detailed inspection will show some freezing frames, frame skipping and time offsets, all resulting in poor-quality annotation that is not visible to the annotator.
Oftentimes bugs that are important for machine learning specialists get overlooked on in browsers.
Take a look at this bug (actually coming FFMpeg issue) – it’s 11 years old and isn’t expected to be resolved by Chromium or FFMpeg teams soon since only machine learning folks care about them, but present a significant bug for the standard media viewing on the web.
Dataloop’s video editor handles these cases for you in the browser (how we accomplished this is probably worth an article by itself- so stay tuned 😉).
Dataloop’s Video Features
Designed for more efficient and productive video labeling.
Scene Classification
Classification allows users to quickly identify the content inside each data item, categorize it into groups and clusters, and then translate the data distribution into immediate insights. With Dataloop’s platform, you can significantly expedite any classification task, by enabling users to select and tag bulks of items simultaneously, automatically switch between images upon completion and identify similarities between pairs or sets of items in one click. In scene classification, the video tells the full story, while images don’t at all. It is very difficult to understand the full story from just a still image, while video allows you to uncover many other details.
In addition, Dataloop’s ML models can automatically segment datasets into clusters in advance, saving you time and allowing users to spend time on validating instead of manually tagging data.
Object Tracking
Our video annotation platform supports these annotation tools: classification, point (with pose), box, note, polygon, polyline, cuboid, and auto annotation tools.
With our linear interpolation and smart tracking models, you can automatically duplicate annotations between video frames and sequenced images, making multiple object tracking extra easy.
Dataloop’s video interpolation enables users to automatically change the position and size of an annotation, based on the average differences between fixed frames.
You can automatically track an object throughout a video using our tracker plugin. This allows you to easily and automatically duplicate annotations between video frames and sequenced images, making multiple object tracking easy.
Select the object you wish to track, provide it with an annotation label, turn on the video tracker and let the magic happen as it predicts future frames and auto annotates them.
Hidden Objects
Avoid creating new annotations every time an object re-appears on screen, which will lead to duplication and discontinuity. Instead, apply our occlusion setting to hide occluded annotations only to specified frames.
With our new updates, when you run into a scenario in which you’re annotating an object that disappears – you don’t need to annotate twice. If you only have 2 objects in the video and they’re constantly appearing and disappearing, you only need to annotate once.
Introducing Dataloop’s Video 2.0 Studio
Video Studio 2.0 introduces frame label annotation that is far easier to manage and gives you more visibility of frames as well as more frame accuracy. Frame level annotation allows you to adjust according to the resolution and accuracy you want. Everything is visualized and quicker than before. If you change a label during the duration of the video you will see these changes visually in the timeline. If you’d like to learn more about these features or speak to an expert, you can schedule a 1:1 personalized demo.