SeC: Advancing Complex Video Object Segmentation
via Progressive Concept Construction

Zhixiong Zhang* · Shuangrui Ding* · Xiaoyi Dong · Songxin He · Jianfan Lin
Junsong Tang · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang
1SJTU   2CUHK   3Shanghai AI Lab  

Abstract

Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. Empirical evaluations demonstrate that SeC substantially outperforms state-of-the-art approaches, including SAM 2 and its advanced variants, on both SeCVOS and standard VOS benchmarks. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.


SeCVOS

The large grilled pork rib held by the short-haired woman in professional attire.
A white soccer ball with a blue and purple pattern.
An orange Dodge Viper sports car.
A grey cat in a police uniform and hat. && A tabby cat in a blue shirt and large gold chain.
A circular red, white, and blue shield with a star.
A LEGO minifigure in a white helmet and a blue jacket holding a white sword.
A humanoid robot with red boxing gloves and headgear.
The violin held by the female performer in a white long dress.
A dark grey Aston Martin sports car.
A red can of Pringles potato chips.
The drum in front of the boy.
The small grey rabbit.

Main Result



BibTeX

  @article{zhang2025sec,
    title     = {SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction},
    author    = {Zhixiong Zhang and Shuangrui Ding and Xiaoyi Dong and Songxin He and Jianfan Lin and Junsong Tang and Yuhang Zang and Yuhang Cao and Dahua Lin and Jiaqi Wang},
    journal   = {arXiv preprint arXiv:2507.15852},
    year      = {2025}
  }