I am currently a Ph.D. student in Electronic Information at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen , supervised by Prof. Jianlong Wu. Prior to this, I completed seven years of undergraduate and master’s studies at Dalian University of Technology.

My current research interests focus on Multimodal Large Language Models, Multimodal Reasoning, Open World Object Detection and Segmentation, and Video Understanding & Temporal Analysis.

If you’re interested in my research or have any research-related questions, please feel free to contact me via email at zhangjinrong731@gmail.com.

📖 Education

2025.09 - 2029.06, Harbin Institute of Technology, Shenzhen
PhD in Electronic Information, School of Computer Science and Technology (Supervisor: Prof. Jianlong Wu)
2022.09 - 2025.06, Dalian University of Technology
Master in Control Science and Engineering, School of Control Science and Engineering
2018.09 - 2022.06, Dalian University of Technology
Bachelor in Transportation Engineering, School of Transportation and Logistics

💻 Internship Experience & Programs

Open World Object Detection & Segmentation: Research Intern, Xiaomi, 2024.03 - 2024.09
- Developed an open-world detection and segmentation large model solution from scratch for the large die casting defect detection business at Xiaomi Automobile Factory. Outperformed SAM2 by achieving a 4.6% increase in AP50 on the proprietary validation set, with a significant >30% performance boost on long-tail categories.
- Published a paper at AAAI based on the experimental research and findings conducted during the internship.
Complex Video Object Segmentation: CVPR Workshop, 5th PVUW Challenge, 2026.02 - 2026.03
- The 1st Winner for 5th MOSE Challenge: Proposed tracking-enhanced prompt strategy to improve SAM3’s understanding capabilities for tiny and semantic-dominated objects, outperformed the 2nd place by 1.75% in J&F_new.
- The 1st Winner for 5th MeViS-Text Challenge: Engineered an MLLM-guided SAM3-agent pipeline for iterative mask grounding and autonomous semantic refinement, outperformed the 2nd place by 7.91% in J&F.

📝 Publications

* Eauql contribution. # Corresponding author

CVPR 2026

Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework

Jinrong Zhang, Zhaoyang Xu, Xusheng He, Xinrui Li, Na Zheng, Jianlong Wu#

First Author, CVPR (CCF-A), 2026

Revealed the limitations of current MLLMs in pixel-level regional perception and propose a multimodal MCTS-based reasoning framework to overcome this bottleneck, enabling a 7B model to achieve performance comparable to 72B.

AAAI 2025

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong Zhang#, Erli Meng#, Zhengnan Hu

First Author, AAAI (CCF-A), 2025

Introduced a novel visual prompts for open-world object detection and segmentation, which bypasses the inherent ambiguity of textual descriptions and achieves enhanced performance and robustness.

CVPR 2025

DTOS: Dynamic Time Object Sensing with Large Multimodal Model

Jirui Tian*, Jinrong Zhang*, Shenglan Liu#, Luhao Xu, Zhixiong Huang, Gao Huang

Co-First Author, CVPR (CCF-A), 2025

Designed DTOS for precise spatio-temporal grounding in MLLMs, utilizing a two-stage temporal-spatial refinement mechanism and task-specific quantitative tokens to accurately locate discontinuous targets while overcoming information loss caused by uniform sampling and natural language ambiguity.

TNNLS 2025

End-to-End Streaming Video Temporal Action Segmentation With Reinforcement Learning

JinRong Zhang, WuJun Wen, ShengLan Liu, Gao Huang, YunHeng Li, QiFeng Li, Lin Feng#

First Author, TNNLS (IF = 8.9), 2025

Pioneering a streaming video paradigm for temporal action segmentation unlocks ultra-long and online processing capabilities. Analyzing the intrinsic properties of streaming video and introducing a reinforcement learning paradigm successfully bridges the resulting feature gap.

ICME 2025

Flexible Streaming Temporal Action Segmentation with Diffusion Models

Jinrong Zhang, Wujun Wen, Shenglan Liu, Sifan Zhang, Yuning Ding, Lin Feng#

First Author, ICME (CCF-B), 2025

Extending to dynamic streaming video scenarios to evaluate intrinsic feature variations across variable-length clips, and deploying a diffusion model architecture that fundamentally shifts the task paradigm to yield temporal robustness.

🗒️ Others

Multidimensional refinement graph convolutional network with robust decouple loss for fine-grained skeleton-based action recognition: Sheng-Lan Liu, Yu-Ning Ding, Jin-Rong Zhang, et al. TNNLS 2024.
2M-AF: A Strong Multi-Modality Framework for Human Action Quality Assessment with Self-Supervised Representation Learning: Yuning Ding, Sifan Zhang, Liu Shenglan, Jinrong Zhang, et al. ACM MM 2024.
Cluster-Refined Optimal Transport for Unsupervised Action Segmentation: Shijie Wang*, Jinrong Zhang*, et al. ICASSP 2025.
Unsupervised Temporal Action Segmentation Based on Wavelet Feature Processing: Xianghan Lin*, Jinrong Zhang*, et al. IJCNN 2025.

🎖 Achievements

Outstanding Graduate, Dalian University of Technology
International Underwater Robot Competition, Championship
The 19th RoboMaster Robotics Competition, Second Prize
China Robotics Competition (Underwater Robot Operations Project), Second Prize
Chinese Collegiate Computing Competition, Second Prize

Jinrong Zhang张津榕

📖 Education

💻 Internship Experience & Programs

📝 Publications

🗒️ Others

🎖 Achievements

Jinrong Zhang
张津榕