A Free Open Source Full Motion Video Workflow

Darren Wiens March 24, 2022

black and gray binoculars on brown wooden dock

For a lot of geospatial practitioners, remote sensing data means one thing: orthorectified raster images. This makes perfect sense, top-down static imagery surely accounts for the majority of remote sensing data in use today. Whether the imagery sensor is mounted on an Unmanned Aerial Vehicle (UAV), airborne system (fixed-wing or helicopter), or satellite, the product delivered to remote sensing specialists is generally one or more discrete, orthorectified images. However, there is another, somewhat more elusive type of remotely sensed data that this post will attempt to expose: full motion video (FMV).

Before I get too far, there are a few accepted definitions for the term, “full motion video”. To some, FMV means a video in which the camera sensor may be in motion. To others, any video in which there is discernible motion between video frames may be considered FMV. Yet to others, FMV refers to a technique in video games where action occurs through video clips. To clarify, this post is concerned with full motion video in the first two senses: collected in such a way that both the sensor location and video frame viewing area may be geographically positioned at all times, and may show motion between video frames. This type of full motion video will be hereafter simply referred to as, “video.”

Overhead video is not a new concept. We’ve likely all seen weather and police helicopter footage, and many commercially available UAVs collect video by default. Some commercial satellites also collect video (e.g. Planet SkySat, Earth-i Vivid-i, and Satellogic Full Motion Video). While access to overhead video is possible for those motivated, the ability to use the data in a geographic application varies between datasets and providers/manufacturers.

In order to be suitable for geographic analysis and display, rapidly collected intra-video metadata is required, embedded within or alongside the video data. For as many time points as possible, at the very least, the following metadata should be provided or derivable: time, as well as 3D geographic coordinates for the frame corners and sensor location. While there is no true consensus regarding geographic video metadata overall, the most popular standards for airborne video are those recommended by the Motion Imagery Standards Board (MISB). One such recommended standard is KLV (key-length-value), which specifies compliant metadata packets (metadata embedded within video files at individual time points). If video data does not contain a KLV data stream, bespoke metadata must be defined and delivered by the data provider or sensor manufacturer. For example, detailed DJI UAV video metadata may be produced post-flight, upon request.

The current ecosystem of free, open source geographic video tooling, that I have found, is limited. The main tools I used in, or referred to while making, this post are:

QGIS FMV plug-in: A great resource for understanding full motion video code
klvdata: Python module for decoding KLV metadata
FFmpeg/FFmpeg-python: CLI and associated Python bindings for working with video files

There are many proprietary processing and viewing options for geographic video, which are out of scope for this blog post.

This post will outline a scalable, standardized, free and open-source workflow for preparing and displaying overhead video content on a map.

The Plan

The sample video I used for this example can be found here: https://www.arcgis.com/home/item.html?id=55ec6f32d5e342fcbfba376ca2cc409a

This video features a truck driving on a highway, while the airborne camera circles overhead.

Excerpt from sample truck FMV (source: https://www.arcgis.com/home/item.html?id=55ec6f32d5e342fcbfba376ca2cc409a).

I only have one main use-case in mind for this video: display the playing video on a map, correctly geolocated at all times.

Metadata Structure

Before starting this project, I knew only a few things:

Mapbox GL JS and MapLibre will play videos within a video source/layer (examples: Mapbox, MapLibre), bounded by four geographic corner coordinates. It is possible to update those coordinates through time, in order to synchronize the video with geographic location.
I wanted this workflow to be scalable, so I would need a standardized configuration or metadata file that will provide enough information to drive the map, given any video.

I am familiar with the Spatiotemporal Asset Catalog (STAC) specification, but there was no defined schema for including video-specific metadata. I have always wanted to contribute something to the specification, and this seemed like a nice opportunity to create a STAC extension, so that’s what I did, available here (with lots of encouragement and help from the community).

The new video extension is a standardized set of properties and recommendations for describing items containing video assets. Specifically, the extension introduces item-level properties like pixel dimensions, frame count/rate, and file encoding. There are also recommended companion vector assets that we will use in this workflow, most importantly, those assigned asset roles containing “video:frame_geometries” and “video:sensor_centers”.

Ideally, the geometry files will be sufficient to enable an animation like this, correctly locating the sensor and video frame through time:

Desired geometries to derive from video file or associated metadata.

Data Preparation

The sample video (MPEG-2 transport stream, *.ts) contains KLV-encoded metadata in the data channel, in addition to the standard video and audio channels. While it may be possible to use the video, as is, and extract the KLV stream at runtime, it is, at the very least, complicated and would require some amount of custom JavaScript that is beyond my grasp. Using in situ KLV metadata on-the-fly would be compliant with the video extension, however, by design.

My simpler (admittedly, naive) solution is to extract KLV metadata beforehand, and store frame geometries and sensor centers in sidecar geojson files.

Using a few freely available tools, extracting and decoding frame corner coordinates can be done like so:

Copy the data stream to a temporary binary file, having set variables for the input .ts file ($fname) and output .bin file ($outbin):

$ ffmpeg -i $fname -map d -codec copy -f data $outbin

In Python, after installing/importing the klvdata module, we can parse the binary file like the following. Each instance of metadata is a dictionary containing all KLV fields, including corner coordinates, sensor coordinates, and timestamp values, which can be written to our asset geojson files:

with open(outbin, 'rb') as f:
  for packet in klvdata.StreamParser(f):
    metadata=packet.MetadataList()
    # extract desired metadata and write to geojson here

We can also convert the transport stream video file to mp4 format, where we have set variables for $fname and $outmp4:

$ ffmpeg -i $fname -c:v libx264 -crf 0 -c:a copy $outmp4 -y

The last step is create a STAC item which implements the video extension. You can find an example here. Note the video properties, as well the way assets are organized (assets for the video and geojson files, indicated and grouped by asset roles).

Optional: there is no guarantee that every video frame has a corresponding KLV-encoded metadata packet. In the case of our sample video, there are 4441 frames, but only 711 metadata packets, meaning that there are about six times more video frames than known frame locations. The consequence of this is that the video will appear to move in a visibly halting fashion, stalling at each location for 6 frames. The solution for this problem is to interpolate frame locations between known locations, which you can do by inserting intermediate frame corner coordinates between known corner coordinates:

Interpolate frame corner coordinates to smoothly transition between known corner coordinates extracted from metadata packets.

The Final Map

Finally, we want to load and play the video on a map. The most basic architecture would look like this, where the UI references a single video file directly, and cycles through frame geometries during playback:

Simple architecture, referencing assets directly.

A slightly more sophisticated architecture looks like below, where the UI makes a request to a STAC API (spec, example implementation) for a single STAC item stored in the database. We can then use the hrefs within the STAC item to request short-lived urls (e.g. AWS presigned urls or Azure SAS tokens) corresponding to the video and geojson assets stored in s3. Finally, we can use the generated urls to request and use the asset data in our map. Temporary urls don’t solve all of our access problems, but at least the links expire after a given amount of time, and we gain some access control at the API level.

A more sophisticated architecture, allowing STAC queries against a database through a STAC API, and generating short-lived URLs on demand through a Presigned URL API.

You can see the code for a map UI that assumes an architecture like that above, in this gist. The final map with moving video looks like this:

FMV produced with the workflow presented in this blog post, including interpolated frame coordinates, and free floating camera.

Using Mapbox GL JS FreeCamera API, and the same code in the gist, setting `sync_camera=true`, we can also watch the video from the point of view of the original sensor:

FMV produced with the workflow presented in this blog post, including interpolated frame coordinates, and synchronized camera.

Future work would involve creating a more fully featured UI, implementing proper authentication, and building out the STAC catalog to include more videos and geometric metadata.

If you make use of either the UI or STAC extension presented in this post, or have any questions or feedback regarding any of its content, please get in touch!

Postscript: While writing this blog post, I became aware of “OGC Testbed-16: Full Motion Video to Moving Features Engineering Report” and the associated Moving Features OGC standard and W3C WebVMT draft spec, which cover a lot of the same ground (quite literally, the report uses the same truck video as a demonstration), and I take it as some amount of validation that there is definite overlap between the report/standard/spec and this workflow. If nothing else, readers of this blog post should be aware of these OGC/W3C initiatives, and I will attempt to align future development of the STAC Video extension with them where possible.