Temporal Preference Optimization for
Long-Form Video Understanding

1Stanford University 2University of Science and Technology of China,
*Equal contribution Work done at Stanford University

Abstract

Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks—LongVideoBench, MLVU, and Video-MME—demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding.


Temporal Preference Optimization

Our work propose Temporal Preference Optimization (TPO), which serves as comprehensive pipeline for self-training based temporal preference optimization for cutting-edge video large multimodal models (video-LMMs). TPO enhances video comprehension in video-LMMs by modeling temporal preferences at two granular levels: localized and comprehensive TPO. In localized TPO (upper-left), we generate queries focused on short segments, with contrastive responses that retain or exclude the target segment. For comprehensive TPO (lower-left), queries are designed broader understanding, using intact video versus sparse downsampled video for contrasting responses. After post-filtering, the contrast response pairs are serving as the preference dataset to train a video-LMM, guiding the model to prioritize preferred responses for improved video understanding.



Performance

Results of LongVA-TPO on LongVideoBench, MLVU, and Video-MME benchmarks compared to 3 baseline methods: LongVA+SFTSelf, LongVA+SFTLLM and LongVA+Hound-DPO. The Video-MME results are presented in the format "w/o subs / w/ subs". The results for LongVA and LongVA+Hound-DPO are based on publicly available checkpoints, while the other results are evaluated using our trained model.

Results on 3 benchmarks compared with the state-of-the-art. The Video-MME results are presented in the format as "w/o subs / w/ subs".

The performance of LongVA-TPO and LongVA on MLVU with different input lengths. LongVA-TPO consistently shows performance improvements with longer inputs, whereas LongVA experiences performance degradation when the input exceeds 64 frames.

Comparison of our LongVA-TPO model and LongVA on the needle-in-a-haystack task.

Examples of Temporal Preference Dataset

In Example (a), the task involves an OCR-based query aimed at retrieving the quote located beneath a mural. The dispreferred response incorrectly identifies the relevant frame, failing to locate the quote below the mural and instead referencing another frame containing the phrase “Forward, Warrior.” In contrast, the preferred response accurately identifies the corresponding frame based on the question. This is achieved by leveraging the highly relevant sub-video segment provided to the video-LMM, enabling the correct extraction of both the quote and its attribution.

Qualitative Analysis

In the first example, which involves a temporal localization and OCR task, our LongVATPO model demonstrates superior performance by accurately localizing the relevant information within the video and providing the correct answer to the OCR question. In the second example, a video discussing the Moon's formation, LongVA misinterprets the video content by relating it to the Earth's formation. In contrast, our LongVA-TPO model successfully comprehends and captures the key details of the video's content.

BibTeX

@misc{li2025temporalpreferenceoptimizationlongform,
      title={Temporal Preference Optimization for Long-Form Video Understanding}, 
      author={Rui Li and Xiaohan Wang and Yuhui Zhang and Zeyu Wang and Serena Yeung-Levy},
      year={2025},
      eprint={2501.13919},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.13919}, 
}