This tool extracts frames, motion vectors, frame types and timestamps from H.264 and MPEG-4 Part 2 encoded videos.
This class is a replacement for OpenCV's VideoCapture and can be used to read and decode video frames from a H.264 or MPEG-4 Part 2 encoded video stream/file. It returns the following values for each frame:
- decoded frame as BGR image
- motion vectors
- Frame type (keyframe, P- or B-frame)
- (for RTSP streams): UTC wall time of the moment the sender sent out a frame (as opposed to an easily retrievable timestamp for the frame reception)
These additional features enable further projects, such as fast visual object tracking or synchronization of multiple RTSP streams. Both a C++ and a Python API is provided. Under the hood FFMPEG is used.
The image below shows a video frame with extracted motion vectors overlaid.
A usage example can be found here.
- Included community contributions (many thanks to @luowyan and @xyperias)
- Added support for Python 3.11 and 3.12 and dropped support for Python 3.8
- Upgraded Docker image from deprecated manylinux_2_24_x86_64 to manylinux_2_28_x86_64
- Improved CI pipeline to run unit tests on every push to a feature branch
- Improved the test suite
- Upgraded build dependencies (OpenCV 4.5.5 -> 4.10.0, numpy 1.x -> 2.0.0)
- Support numpy 2.x as runtime dependency (see this issue)
You can install the motion vector extractor via pip
pip install --upgrade pip
pip install motion-vector-extractor
Note, that we currently provide the package only for x86-64 linux, such as Ubuntu or Debian, and Python 3.9, 3.10, 3.11, 3.12, and 3.13. If you are on a different platform, please use the Docker image as described below.
Download the example video vid_h264.mp4
from the repo and place it somewhere. To extract the motion vectors, open a terminal at the same location and run
extract_mvs vid_h264.mp4 --preview --verbose
The extraction script provides command line options to store extracted motion vectors to disk, and to enable/disable graphical output. For all options type
extract_mvs -h
For example, to store extracted frames and motion vectors to disk without showing graphical output run
extract_mvs vid_h264.mp4 --dump
The --dump
parameter also takes an optional destination directory.
You can run the test suite either directly on your machine or (easier) within the provided Docker container. Both methods require you to first clone the repository. To this end, change into the desired installation directory on your machine and run
git clone https://github.com/LukasBommes/mv-extractor.git mv_extractor
To run the tests in the Docker container, change into the mv_extractor
directory, and run
./run.sh /bin/bash -c 'yum install -y compat-openssl10 && python3.12 -m unittest discover -s tests -p "*tests.py"'
To run the tests directly on your machine, you need to install the motion vector extractor as explained above.
Now, change into the mv_extractor
directory and run the tests with
python3 -m unittest discover -s tests -p "*tests.py"
Confirm that all tests pass.
Some tests run the LIVE555 Media Server, which has dependencies on its own, such as OpenSSL. Make sure these dependencies are installed correctly on your machine, or otherwise you will get test failures with messages, such as "error while loading shared libraries: libssl.so.10: cannot open shared object file: No such file or directory". E.g. in Alma Linux you could fix this issue by installing OpenSSL with
yum install -y compat-openssl10
For other operating systems you may be lacking additional dependencies, and the package names and installation command may differ.
If you want to use the motion vector extractor in your own Python script import it via
from mvextractor.videocap import VideoCap
You can then use it according to the example in extract_mvs.py
.
Generally, a video file is opened by VideoCap.open()
and frames, motion vectors, frame types and timestamps are read by calling VideoCap.read()
repeatedly. Before exiting the program, the video file has to be closed by VideoCap.release()
. For a more detailed explanation see the API documentation below.
Instead of installing the motion vector extractor via PyPI you can also use the prebuild Docker image from DockerHub. The Docker image contains the motion vector extractor and all its dependencies and comes in handy for quick testing or in case your platform is not compatible with the provided Python package.
To use the Docker image you need to install Docker. Furthermore, you need to clone the source code with
git clone https://github.com/LukasBommes/mv-extractor.git mv_extractor
Afterwards, you can run the extraction script in the mv_extractor
directory as follows
./run.sh python3.12 extract_mvs.py vid_h264.mp4 --preview --verbose
This pulls the prebuild Docker image from DockerHub and runs the extraction script inside the Docker container.
This step is not required and for faster installation, we recommend using the prebuilt image.
If you still want to build the Docker image locally, you can do so by running the following command in the mv_extractor
directory
docker build . --tag=mv-extractor
Note that building can take more than one hour.
Now, run the docker container with
docker run -it --ipc=host --env="DISPLAY" -v $(pwd):/home/video_cap -v /tmp/.X11-unix:/tmp/.X11-unix:rw mv-extractor /bin/bash
This module provides a Python API which is very similar to that of OpenCV VideoCapture. Using the Python API is the recommended way of using the H.264 Motion Vector Capture class.
Methods | Description |
---|---|
VideoCap() | Constructor |
open() | Open a video file or url |
grab() | Reads the next video frame and motion vectors from the stream |
retrieve() | Decodes and returns the grabbed frame and motion vectors |
read() | Convenience function which combines a call of grab() and retrieve(). |
release() | Close a video file or url and release all ressources |
Constructor. Takes no input arguments.
Open a video file or url. The stream must be H264 encoded. Otherwise, undesired behaviour is likely.
Parameter | Type | Description |
---|---|---|
url | string | Relative or fully specified file path or an url specifying the location of the video stream. Example "vid.flv" for a video file located in the same directory as the source files. Or "rtsp://xxx.xxx.xxx.xxx:554" for an IP camera streaming via RTSP. |
Returns | Type | Description |
---|---|---|
success | bool | True if video file or url could be opened successfully, false otherwise. |
Reads the next video frame and motion vectors from the stream, but does not yet decode it. Thus, grab() is fast. A subsequent call to retrieve() is needed to decode and return the frame and motion vectors. the purpose of splitting up grab() and retrieve() is to provide a means to capture frames in multi-camera scenarios which are as close in time as possible. To do so, first call grab() on all cameras and afterwards call retrieve() on all cameras.
Takes no input arguments.
Returns | Type | Description |
---|---|---|
success | bool | True if next frame and motion vectors could be grabbed successfully, false otherwise. |
Decodes and returns the grabbed frame and motion vectors. Prior to calling retrieve() on a stream, grab() needs to have been called and returned successfully.
Takes no input arguments and returns a tuple with the elements described in the table below.
Index | Name | Type | Description |
---|---|---|---|
0 | success | bool | True in case the frame and motion vectors could be retrieved sucessfully, false otherwise or in case the end of stream is reached. When false, the other tuple elements are set to empty numpy arrays or 0. |
1 | frame | numpy array | Array of dtype uint8 shape (h, w, 3) containing the decoded video frame. w and h are the width and height of this frame in pixels. Channels are in BGR order. If no frame could be decoded an empty numpy ndarray of shape (0, 0, 3) and dtype uint8 is returned. |
2 | motion vectors | numpy array | Array of dtype int32 and shape (N, 10) containing the N motion vectors of the frame. Each row of the array corresponds to one motion vector. If no motion vectors are present in a frame, e.g. if the frame is an I frame an empty numpy array of shape (0, 10) and dtype int32 is returned. The columns of each vector have the following meaning (also refer to AVMotionVector in FFMPEG documentation): - 0: source : offset of the reference frame from the current frame. The reference frame is the frame where the motion vector points to and where the corresponding macroblock comes from. If source < 0 , the reference frame is in the past. For source > 0 the it is in the future (in display order).- 1: w : width of the vector's macroblock.- 2: h : height of the vector's macroblock.- 3: src_x : x-location (in pixels) where the motion vector points to in the reference frame.- 4: src_y : y-location (in pixels) where the motion vector points to in the reference frame.- 5: dst_x : x-location of the vector's origin in the current frame (in pixels). Corresponds to the x-center coordinate of the corresponding macroblock.- 6: dst_y : y-location of the vector's origin in the current frame (in pixels). Corresponds to the y-center coordinate of the corresponding macroblock.- 7: motion_x : Macroblock displacement in x-direction, multiplied by motion_scale to become integer. Used to compute fractional value for src_x as src_x = dst_x + motion_x / motion_scale .- 8: motion_y : Macroblock displacement in y-direction, multiplied by motion_scale to become integer. Used to compute fractional value for src_y as src_y = dst_y + motion_y / motion_scale .- 9: motion_scale : see definiton of columns 7 and 8. Used to scale up the motion components to integer values. E.g. if motion_scale = 4 , motion components can be integer values but encode a float with 1/4 pixel precision.Note: src_x and src_y are only in integer resolution. They are contained in the AVMotionVector struct and exported only for the sake of completeness. Use equations in field 7 and 8 to get more accurate fractional values for src_x and src_y . |
3 | frame_type | string | Unicode string representing the type of frame. Can be "I" for a keyframe, "P" for a frame with references to only past frames and "B" for a frame with references to both past and future frames. A "?" string indicates an unknown frame type. |
4 | timestamp | double | UTC wall time of each frame in the format of a UNIX timestamp. In case, input is a video file, the timestamp is derived from the system time. If the input is an RTSP stream the timestamp marks the time the frame was send out by the sender (e.g. IP camera). Thus, the timestamp represents the wall time at which the frame was taken rather then the time at which the frame was received. This allows e.g. for accurate synchronization of multiple RTSP streams. In order for this to work, the RTSP sender needs to generate RTCP sender reports which contain a mapping from wall time to stream time. Not all RTSP senders will send sender reports as it is not part of the standard. If IP cameras are used which implement the ONVIF standard, sender reports are always sent and thus timestamps can always be computed. |
Convenience function which internally calls first grab() and then retrieve(). It takes no arguments and returns the same values as retrieve().
Close a video file or url and release all ressources. Takes no input arguments and returns nothing.
The C++ API differs from the Python API in what parameters the methods expect and what values they return. Refer to the docstrings in src/video_cap.hpp
.
What follows is a short explanation of the data returned by the VideoCap
class. Also refer this excellent book by Iain E. Richardson for more details.
The decoded video frame. Nothing special about that.
H.264 and MPEG-4 Part 2 use different techniques to reduce the size of a raw video frame prior to sending it over a network or storing it into a file. One of those techniques is motion estimation and prediction of future frames based on previous or future frames. Each frame is segmented into macroblocks of e.g. 16 pixel x 16 pixel. During encoding motion estimation matches every macroblock to a similar looking macroblock in a previously encoded frame (note that this frame can also be a future frame since encoding and presentation order might differ). This allows to transmit only those motion vectors and the reference macroblock instead of all macroblocks, effectively reducing the amount of transmitted or stored data.
Motion vectors correlate directly with motion in the video scene and are useful for various computer vision tasks, such as visual object tracking.
In MPEG-4 Part 2 macroblocks are always 16 pixel x 16 pixel. In H.264 macroblocks can be 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, or 4x4 in size.
The frame type is either "P", "B" or "I" and refers to the H.264 encoding mode of the current frame. An "I" frame is send fully over the network and serves as a reference for "P" and "B" frames for which only differences to previously decoded frames are transmitted. Those differences are encoded via motion vectors. As a consequence, for an "I" frame no motion vectors are returned by this library. The difference between "P" and "B" frames is that "P" frames refer only to past frames, whereas "B" frames have motion vectors which refer to both past and future frames. References to future frames are possible even with live streams because the decoding order of frames differs from the presentation order.
In addition to extracting motion vectors and frame types, the video capture class also outputs a UNIX timestamp representing UTC wall time for each frame. If the stream originates from a video file, this timestamp is simply derived from the current system time. However, when an RTSP stream is used as input, the timestamp calculation is more intricate as the timestamps represents not the time when the frame was received, but the time when the frame was send by the sender. Thus, this timestamp can be used for accurate synchronization of multiple video streams.
Computation of the frame wall time works as follows:
-
Wait for a RTCP sender report package which contains a mapping between the stream's RTP timestamp and the current UTC wall time. Now, a correlation between RTP timestamps of the stream and wall time is known. Name the RTP timestamp
T_RTP_LAST
and the corresponding UTC wall timeT_UTC_LAST
. -
For each new frame, compute the UTC timestamp as follows:
T_UTC = T_UTC_LAST + (T_RTP - T_RTP_LAST) / 90000
Here T_RTP
is the frame's RTP timestamp and T_RTP_LAST
and T_UTC_LAST
are the RTP timestamp and corresponding UTC wall time of the last RTCP sender report packet. The factor of 90000 is needd because the RTP timestamps increment with 90 kHz per RTSP specification. That means, that the RTP timestamp increments by 90000 every second.
Note, that the sender clock needs to be synchronized with a network time server (via NTP) to ensure frame timestamps are in sync with UTC. Most IP cameras provide an option for this.
This software is written by Lukas Bommes. It is based on MV-Tractus and OpenCV's videoio module.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use our work for academic research please cite
@INPROCEEDINGS{9248145,
author={L. {Bommes} and X. {Lin} and J. {Zhou}},
booktitle={2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA)},
title={MVmed: Fast Multi-Object Tracking in the Compressed Domain},
year={2020},
volume={},
number={},
pages={1419-1424},
doi={10.1109/ICIEA48937.2020.9248145}}