Temporal Difference Networks Explained

15 min readJul 9, 2024

TDN: Temporal Difference Networks for Efficient Action Recognition

Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper…

arxiv.org

Table of Contents:

Concepts to know before starting to read the paper
· Temporal
· Temporal Information
· Short-Term and Long-Term Temporal Information
· Temporal Difference
∘ RGB Difference
∘ Feature Difference
∘ Example
· Temporal Difference Operator
· Sampling Strategy
∘ Holistic Sampling
∘ Sparse Sampling
∘ Combining Strategies
Paper Walkthrough
· 1. Abstract
· 2. Introduction
· 3. Temporal Difference Network — TDN
∘ 3.1 Overview
∘ 3.2 Short-term Temporal Difference Module (S-TDM)
∘ 3.3 Long-term Temporal Difference Module (L-TDM)
∘ 3.4 Overall
4. Experimentation
5. Conclusion

Some Terminologies to understand before reading the paper

Since Temporal Difference Network relies upon the TSN, so it’d be better to go through the Temporal Segment Networks (TSN) before reading the TDN.

Temporal

The term temporal in general relates to time. In the context of video analysis, it is used to describe anything related to the sequence and progression of images(frames) over time.

Temporal Information

Temporal information means the sequence of events or changes that occur over time within a video. It involves capturing how objects and actions evolve from one frame to the next. In a video showing a person waving their hand, temporal information would include the entire sequence of frames showing how the hand moves from one position to another over time.

Short-Term and Long-Term Temporal Information

Short-Term Temporal Information:

Refers to the changes that happen over a short duration, typically between consecutive or closely spaced frames.
Captures immediate movements and actions, such as the swing of a bat or a blink of an eye.
Essential for understanding fast actions and subtle changes.
In the same video of waving, short-term information would focus on the immediate changes from one frame to the next, capturing the precise motion of the hand.

2. Long-Term Temporal Information:

Refers to the changes and patterns that occur over a longer duration, spanning many frames.
Captures the overall progression and structure of actions, such as running from one side of a field to the other or a person performing a dance routine.
Important for understanding activities that have a broader context and involve a sequence of different movements.
In the video of waving, this would involve looking at how the hand wave progresses over the entire duration of the action, capturing the start, middle, and end of the wave.

Temporal Difference

Temporal difference refers to the difference in information between temporally adjacent frames(continuous frames). Temporal difference is a method to capture motion by calculating the change between consecutive frames. This difference highlights areas of the image where movement has occurred, making it easier to detect and understand motion.

By subtracting the pixel values of one frame from the next, we can see which parts of the image have changed, effectively highlighting the moving hand while ignoring the static background.

Temporal difference operations involve calculating the change between consecutive frames in a video to extract motion information. This concept can be applied in various ways, depending on whether the differences are computed at the pixel level (RGB Difference) or at a higher feature level (Feature Difference).

RGB Difference

RGB Difference is a method used for motion extraction in videos by computing the difference between the RGB values of consecutive frames. It helps in highlighting changes or movements within the video.

This difference highlights the pixels that have changed between the frames, indicating motion.

Simonyan and Zisserman (2014): Introduced two-stream convolutional networks, using RGB difference for capturing motion in one of the streams.
Wang et al. (2016): Employed RGB difference in temporal segment networks to improve action recognition accuracy.
Tran et al. (2015): Used RGB difference in C3D networks to capture temporal changes.

Feature Difference

Feature Difference involves computing the difference between features extracted from consecutive frames. Unlike RGB difference, which works directly on pixel values, feature difference works on higher-level features obtained through feature extraction techniques ( output of CNNs).

It provides a higher-level representation of motion by focusing on features rather than raw pixel values.

Used in various video analysis tasks where high-level feature information is more relevant than raw pixel changes.

Wang et al. (2018): Used feature difference in temporal relational networks to model temporal dependencies.
Zhou et al. (2018): Employed feature difference in dynamic image networks for action recognition.
Sun et al. (2015): Used feature difference in a two-stream fusion approach for video classification.

Example

Frame 1: The player is about to kick the ball.
Frame 2: The player’s leg has moved, and the ball has started to move.

When you subtract Frame 1 from Frame 2, you get an image showing the leg and the ball in motion. This image, created by temporal difference, highlights just the movement, making it easier to understand the action.

In summary, temporal difference is a technique to detect movement in videos by comparing consecutive frames and highlighting the changes. This way, it focuses on the action, ignoring the static parts.

Temporal Difference Operator

The temporal difference operator computes the difference between pixel values of consecutive frames.

By applying the temporal difference operator across all pairs of consecutive frames in a video, you can create a series of difference maps that capture the motion throughout the video.

These difference maps can be used as input to machine learning models, particularly convolutional neural networks (CNNs), to help the model learn and recognize motions.

Sampling Strategy

Holistic and sparse sampling strategies are methods used to select frames or segments from a video to effectively capture and represent the temporal information for tasks such as action recognition. These strategies balance the need to reduce computational complexity while still maintaining the important temporal dynamics of the video.

Holistic Sampling

Holistic sampling involves selecting a broad, comprehensive set of frames from the entire video. The goal is to capture the overall context and full sequence of actions happening in the video.

Frames are sampled uniformly across the entire duration of the video.
This ensures that all parts of the video contribute to the analysis, providing a complete temporal overview. Imagine a video of a person cooking a meal. Holistic sampling would select frames evenly throughout the entire cooking process.

Sparse Sampling

Sparse sampling involves selecting a limited number of key frames or segments from the video. The goal is to reduce the computational load while still capturing the essential movements and actions.

Frames are sampled at sparse, strategically chosen intervals.
Often, key frames are selected based on certain criteria, such as high motion activity or specific time intervals. For the same cooking video, sparse sampling might select frames only at key moments, such as chopping vegetables, stirring the pot, and plating the food. This reduces the number of frames to analyze but focuses on the critical actions.

Combining Strategies

One thing you’ll notice in this paper is that they have applied both sampling strategies. First they divide the video into k segments, then they take 5 frames from each segment , making it total of 5$*$k frames per video (dense sampling). In each segment, the central frame is singly passed to CNN which captures its spatial information, then the temporal difference of other remaining frames is calculated for capturing the motion information. Finally the output of CNN is concatenated with motion information. We see that , the 5 frames per segment is reduced to 1 frame containing motion information (sparse) per segment and it is passed for further processing.

Didn’t get it? Don’t worry. Keep reading, Everything is explained.

Paper Walkthrough

1. Abstract

Temporal modeling for action recognition in videos is a complex task. To address this, the authors introduce the Temporal Difference Networks (TDN), a new architecture designed to capture multi-scale temporal information efficiently. The authors integrate TDN with existing CNNs (Resnet here) with minimal additional computational cost.

Temporal Difference Module (TDM): The authors use a temporal difference operator to create this module, which helps in understanding both short-term and long-term motion in videos.

Two-level Difference Modeling:

Local Motion Modeling: Within one segment, they use temporal differences between adjacent frames to give 2D CNNs finer motion details.
Global Motion Modeling: Then the temporal difference across segments is incorporated to capture long-range motion structures, enhancing motion feature detection.

2. Introduction

Temporal modeling is crucial for capturing motion information in videos for action recognition, typically achieved through two main mechanisms in current deep learning approaches.

Method 1: Involves using a two-stream network, where one stream processes RGB frames to extract appearance information, and the other leverages optical flow as input to capture movement information. This method is effective for improving action recognition accuracy but requires high computational resources for optical flow calculation.
Method 2: Use 3D convolutions or temporal convolutions to implicitly learn motion features from RGB frames. However, 3D convolutions often lack specific consideration in the temporal dimension and might result in higher computational costs.

Temporal derivatives (differences) are highly relevant to optical flow and have shown effectiveness in action recognition by using RGB differences as an approximate motion representation.

Previous methods treated the changes between consecutive video frames (RGB differences) as separate data and used a different network to process these changes. Then, they combined the results with the main video processing network. In contrast, the authors propose a unified framework that captures both appearance and motion information together. They do this by developing a more structured and efficient temporal module that can be integrated into the overall network design from start to finish.

Simply saying, the authors suggest a new method that combines the processing of the main video frames and the changes between frames into a single, more efficient network. So, instead of using two separate processes and then merging the results, the authors’ method does everything in one go, saving time and resources.

Both short-term and long-term motion information are essential for recognizing actions in videos because they capture different but complementary aspects of an action.

Short-Term Motion (Local Motion): They use a lightweight module (STDM) that processes differences between consecutive frames. This module works alongside the main RGB frame processing to add fine motion details. e.g.: You notice the quick arm movements in consecutive frames. Stage 1 and 2 in the figure below.
Long-Term Motion (Global Motion): They use a more complex module (LTDM) that captures changes over longer periods by looking at differences across several segments of the video. This helps in understanding broader motion patterns. e.g.: You observe the overall action of picking up the ball, throwing it, and watching it fly. Stage 3–5 in the figure below.

Two datasets used : Kinetics and Something-Something (yes that’s the name of dataset).

Main Contributions:

They generalize the idea of RGB difference to devise an efficient temporal difference module (TDM) for motion modeling in videos, providing an alternative to 3D convolutions by systematically presenting effective and detailed module designs.
Their TDN presents a video-level motion modeling framework with the proposed temporal difference module, focusing on capturing both short-term and long-term temporal structures for video recognition.

3. Temporal Difference Network — TDN

3.1 Overview

Temporal Difference Network (TDN) is a video-level framework designed for learning action models using the entire video information. The key contribution is incorporating the temporal difference operator into the network design to explicitly capture both short-term and long-term motion information.

Each video V is divided into T segments of equal duration without overlapping. From a segment, we sample m number of frames for passing it to STDM. So total initial frames is m * T. We pass it to stage 1-2 (STDM). After STDM, we get one frame per segment that contains both spatial and local motion information. Thus, total of T frames from a video.

I = [I_1,I_2,I_3…….I_T] where the shape of Image is [T,C,H,W].

3.2 Short-term Temporal Difference Module (S-TDM)

The adjacent frames in a video are very similar to each other. Directly stacking multiple frames for processing is inefficient because adjacent frames are similar so not much new information is there. On the other hand, using only a single frame fails to capture the motion information between frames. Therefore, the S-TDM aims to enhance a single RGB frame with temporal differences to efficiently represent both appearance(spatial) and motion information.

The short-term TDM aims to supply these frame-wise representations F of early layers with local motion information to improve their representation power:

F_i : This is the features we obtained after passing the frame through a 2D CNN(upper conv1 in the image)
H_i : Calculated motion information from the frame I_i by considering the differences between I_i and its adjacent frames.

The initial feature representation F_i is then updated by adding the motion information H( I_i ) to it which gives out more enhanced representation. The short-term TDM is like adding motion details to the features the network has already learned from each frame. This helps the network understand not just what is in each frame, but also how things are moving between frames, making it better at recognizing actions.

Short term TDM: F_i = F_i + H(I _i ) calculation steps

Extract Temporal Differences: For each frame I, several RGB differences are extracted from a region around I and then they are stacked along the channel dimension to form D(Iᵢ)

Short-term TDM Operation: The stacked RGB differences D(Iᵢ) are processed through the following steps:

Down sample: Reduce the resolution by half using average pooling.
2D CNN: Extract motion features using a lightweight 2D CNN.
Upsample: Increase the resolution back to match the original RGB features.
The mathematical representation of this operation is:

Finally, Fusion with output of CNN : Short term TDM: F_i = F_i + H( I_i )

Simple Summary: Remember in one segment, we took 5 frames? The central frame is added with motion information generated with help of other 4 frames.

In one particular segment, we are passing the central frame to CNN to extract its spatial information, then within that segment around the central frame’s local window, we sample more frames to calculate the RGB difference, which gives us the local motion information we talked about earlier, then that output of CNN(contains spatial information) is merged with temporal difference(contains motion information).

3.3 Long-term Temporal Difference Module (L-TDM)

The long-term TDM improves the network’s understanding of actions by looking at the bigger picture, across different parts of the video(cross-segments), not just a few frames.

The Long-term Temporal Difference Module (Long-term TDM) is designed to improve the frame-level features by considering how motion changes across different segments of the video.

F_i : This is the feature representation of the frame I<sub>i</sub> at position I in the video.
G(F_i,F_i+1) : This is the operation performed by the long-term TDM. It enhances the feature representation by looking at the difference between the features of the current segment F_i and the next segment F_i+1.

TDN presents a two-level motion modeling mechanism, with a focus on capturing temporal information in a local to-global fashion. In particular, Short-term TDMs (S-TDM) in early stages for finer and low-level motion extraction, and Long-term TDMs (L-TDM)into latter stages for coarser and high-level temporal structure modeling.

3.4 Overall

Complete Flow of the official code :

Since it was a very broad file so couldn’t upload the clear image here. Check this link It’s a .htm file, download it and open with any browser.

4. Experimentation

The TDN (Temporal Difference Network) architecture is designed to capture both short-term and long-term changes in video frames, improving the accuracy of action recognition by effectively modeling temporal information.

Datasets and Training:

Kinetics Dataset: This dataset contains around 300,000 trimmed videos covering 400 categories. The authors trained TDN on 240,000 of these videos and evaluated it on 20,000 validation videos.
Something-Something Datasets: These datasets focus on actions involving various objects to emphasize motion rather than object context. The first version has about 100,000 videos across 174 categories, while the second version includes 169,000 training videos and 25,000 validation videos. The authors reported performance on the validation sets of both versions.

2. Model Architecture:

Backbones: TDN uses ResNet50 and ResNet101 as backbone models, sampling T = 8 or T = 16 frames from each video.
Training Details: During training, each video frame was resized to have a shorter side between 256 and 320 pixels, and a crop of 224 × 224 pixels was randomly selected. TDN was pre-trained on the ImageNet dataset. The batch size was set to 128, and the initial learning rate was 0.02. Training lasted for 100 epochs on the Kinetics dataset and 60 epochs on the Something-Something dataset, with the learning rate reduced by a factor of 10 when performance on the validation set plateaued.

3. Testing Schemes:

The authors used two testing schemes:
1-clip and center-crop: Only a center crop of 224 × 224 pixels from a single clip was used for evaluation.
10-clip and 3-crop: Three crops of 256 × 256 pixels and 10 clips were used for testing, leading to better accuracy with a denser prediction scheme.

4. Ablation Studies:

Ablation studies means analyzing individual components of a model to understand their contributions to overall performance. Something like tinkering with each and individual component in the system.

Effect of Difference Operation: Using a temporal difference operator significantly improved accuracy compared to simply stacking or averaging temporal frames.
Short-term TDM (S-TDM): Various implementations of S-TDM were tested, and the addition of the temporal difference operation gave the best results.
Long-term TDM (L-TDM): Different implementations of L-TDM were evaluated, with spatiotemporal attention providing the best outcomes.
Location of TDMs: The study showed that placing S-TDMs in the early stages and L-TDMs in the later stages of ResNet50 resulted in the highest recognition accuracy.
Comparison with Other Temporal Modules: TDN was compared with other temporal modules like Temporal Convolution (T-Conv), Temporal Shift Module (TSM), and TEINet. TDN achieved superior performance.

5. Comparison with State-of-the-Art Methods:

TDN was compared with various state-of-the-art methods on the Something-Something V1 & V2 datasets and the Kinetics-400 dataset. Results showed that TDN consistently outperformed other methods, including those using 2D CNNs and 3D CNNs.

Through these extensive evaluations, the authors demonstrated that TDN is highly effective at capturing temporal differences, leading to improved action recognition in videos.

5. Conclusion

In this paper, the authors introduced a new framework called TDN (Temporal Difference Network) for recognizing actions in videos. The key innovation of TDN is the use of Temporal Difference Modules (TDM) which effectively capture both short-term and long-term changes in videos. TDMs are designed to be efficient and versatile, and the authors provided two specific implementations of these modules.

The authors systematically evaluated the impact of TDMs on video analysis and demonstrated that TDN outperforms previous state-of-the-art methods on the Kinetics-400 and Something-Something datasets.

They also conducted detailed ablation studies to explore the importance of the temporal difference operation. The results showed that using temporal differences is better at capturing detailed temporal information than standard 3D convolutions with more frames. The authors hope that their findings provide valuable insights into temporal modeling in videos and suggest that TDMs could be a good alternative to traditional 3D convolutions.

If you spot any errors or have any suggestions for improvement, please let me know at sandesh.cdr@gmail.com so I can correct them. Your feedback is greatly appreciated! Thanks for reading. Keep learning.