I am currently a Ph.D. student in Electronic Information at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen , supervised by Prof. Jianlong Wu. Prior to this, I completed seven years of undergraduate and master’s studies at Dalian University of Technology.

My current research interests focus on Multimodal Large Language Models, Multimodal Reasoning, Open World Object Detection and Segmentation, and Video Understanding & Temporal Analysis.

If you’re interested in my research or have any research-related questions, please feel free to contact me via email at zhangjinrong731@gmail.com.

📖 Education

  • 2025.09 - 2029.06, Harbin Institute of Technology, Shenzhen
    PhD in Electronic Information, School of Computer Science and Technology (Supervisor: Prof. Jianlong Wu)
  • 2022.09 - 2025.06, Dalian University of Technology
    Master in Control Science and Engineering, School of Control Science and Engineering
  • 2018.09 - 2022.06, Dalian University of Technology
    Bachelor in Transportation Engineering, School of Transportation and Logistics

💻 Internship Experience & Programs

  • Open World Object Detection & Segmentation: Research Intern, Xiaomi, 2024.03 - 2024.09
    • Developed an open-world detection and segmentation large model solution from scratch for the large die casting defect detection business at Xiaomi Automobile Factory. Outperformed SAM2 by achieving a 4.6% increase in AP50 on the proprietary validation set, with a significant >30% performance boost on long-tail categories.
    • Published a paper at AAAI based on the experimental research and findings conducted during the internship.
  • Complex Video Object Segmentation: CVPR Workshop, 5th PVUW Challenge, 2026.02 - 2026.03
    • The 1st Winner for 5th MOSE Challenge: Proposed tracking-enhanced prompt strategy to improve SAM3’s understanding capabilities for tiny and semantic-dominated objects, outperformed the 2nd place by 1.75% in J&F_new.
    • The 1st Winner for 5th MeViS-Text Challenge: Engineered an MLLM-guided SAM3-agent pipeline for iterative mask grounding and autonomous semantic refinement, outperformed the 2nd place by 7.91% in J&F.

📝 Publications

* Eauql contribution. # Corresponding author

CVPR 2026
sym

Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework

Jinrong Zhang, Zhaoyang Xu, Xusheng He, Xinrui Li, Na Zheng, Jianlong Wu#

First Author, CVPR (CCF-A), 2026

  • Revealed the limitations of current MLLMs in pixel-level regional perception and propose a multimodal MCTS-based reasoning framework to overcome this bottleneck, enabling a 7B model to achieve performance comparable to 72B.
AAAI 2025
sym

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong Zhang#, Erli Meng#, Zhengnan Hu

First Author, AAAI (CCF-A), 2025

  • Introduced a novel visual prompts for open-world object detection and segmentation, which bypasses the inherent ambiguity of textual descriptions and achieves enhanced performance and robustness.
CVPR 2025
sym

DTOS: Dynamic Time Object Sensing with Large Multimodal Model

Jirui Tian*, Jinrong Zhang*, Shenglan Liu#, Luhao Xu, Zhixiong Huang, Gao Huang

Co-First Author, CVPR (CCF-A), 2025

  • Designed DTOS for precise spatio-temporal grounding in MLLMs, utilizing a two-stage temporal-spatial refinement mechanism and task-specific quantitative tokens to accurately locate discontinuous targets while overcoming information loss caused by uniform sampling and natural language ambiguity.
TNNLS 2025
sym

End-to-End Streaming Video Temporal Action Segmentation With Reinforcement Learning

JinRong Zhang, WuJun Wen, ShengLan Liu, Gao Huang, YunHeng Li, QiFeng Li, Lin Feng#

First Author, TNNLS (IF = 8.9), 2025

  • Pioneering a streaming video paradigm for temporal action segmentation unlocks ultra-long and online processing capabilities. Analyzing the intrinsic properties of streaming video and introducing a reinforcement learning paradigm successfully bridges the resulting feature gap.
ICME 2025
sym

Flexible Streaming Temporal Action Segmentation with Diffusion Models

Jinrong Zhang, Wujun Wen, Shenglan Liu, Sifan Zhang, Yuning Ding, Lin Feng#

First Author, ICME (CCF-B), 2025

  • Extending to dynamic streaming video scenarios to evaluate intrinsic feature variations across variable-length clips, and deploying a diffusion model architecture that fundamentally shifts the task paradigm to yield temporal robustness.

🗒️ Others

🎖 Achievements

  • Outstanding Graduate, Dalian University of Technology
  • International Underwater Robot Competition, Championship
  • The 19th RoboMaster Robotics Competition, Second Prize
  • China Robotics Competition (Underwater Robot Operations Project), Second Prize
  • Chinese Collegiate Computing Competition, Second Prize