Hi there, I’am Muyi Bao(包沐亦)

I am currently an M.S. student in Electrical and Computer Engineering at Carnegie Mellon University. My research interests lie in Embodied AI and multimodal foundation models for robotics, focusing on building vision-language agents that can understand human instructions, interpret visual observations, and produce semantically aligned and physically executable behavior in embodied environments. In particular, I am interested in developing model designs and learning algorithms that improve the performance, efficiency, reliability, and safety of embodied agents.

At CMU, I am working with Dr. Ji Zhang and Dr. Wenshan Wang on Embodied AI, with a focus on Vision-and-Language Navigation. Before joining CMU, I received my B.Eng. degree in Computer Science and Technology from Xi’an Jiaotong-Liverpool University in 2025. During my undergraduate studies, I focused on computer vision and was fortunate to work with Prof. Guangliang Cheng, Prof. Wei Wang, and Prof. Ming Xu.

My resume can be found here (updated in 2025.08.10) and my email is muyib@andrew.cmu.edu.

I am actively looking for Ph.D. opportunities starting in Fall 2027, with research interests in Embodied AI and multimodal foundation models for robotics.

News

[2026/06] 📄 Two papers, Goal2Pixel and IntentNav, were submitted to CoRL 2026.
[2026/02] 📄 Our survey paper, Vision Mamba in Remote Sensing, was accepted by Remote Sensing.
[2025/11] 📄 NUMINA was accepted to Findings of EMNLP 2025.
[2025/08] 🎓 I joined Carnegie Mellon University as an M.S. student in Electrical and Computer Engineering.
[2025/07] 📄 FTCFormer was accepted by ECAI 2025.
[2025/06] 🎓 I received my B.Eng. degree in Computer Science and Technology from Xi'an Jiaotong-Liverpool University.
[2025/02] 📄 My first paper, AlexCapsNet, was accepted by IEEE Access.
[2024/12] 📄 One paper on Performance Analysis of Rendering Optimization was accepted by UIC 2024.
[2021/06] 🎓 I joined Xi'an Jiaotong Liverpool Unverisity as a Bachelor of Engineering in Computer Science and Technology .

Research Projects

	Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation Muyi Bao^, Yuxin Cai^, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang ^* Equal contribution arXiv, 2026 Goal2Pixel reformulates Vision-and-Language Navigation in Continuous Environments (VLN-CE) as a pure navigable-pixel grounding problem, using the image plane as a unified spatial interface between VLM reasoning and robot motion. For history representation, Goal2Pixel introduces Visibility-Aware Keyframe Memory (ViKeyMem), cutting training/inference time by around 50%. Goal2Pixel further incorporates semantic directive embeddings and coordinate-aware auxiliary losses to better adapt the VLM to VLN-CE. Goal2Pixel achieves 54.1% SR and 52.5% SPL on R2R-CE Val-Unseen with only 7.75 VLM calls per episode — 6× fewer than action prediction baselines. Project Page / Paper / Code
	IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations Yuxin Cai^, Zongtai Li^, Maonan Wang, Muyi Bao, Haokun Zhu, Ruofei Bai, Ding Zhao, Zirui Li, Wenshan Wang, Wei-Yun Yau, Ji Zhang, Chen Lv ^* Equal contribution Arxiv, 2026 IntentNav learns human-like ObjectNav policies from 23,767 human demonstrations by converting low-level trajectories into 2.36 million candidate-level waypoint supervision samples, without relying on oracle shortest paths. The model unifies frontier exploration and target commitment in a BEV-grounded patial-visual decision space and achieves 53.8% SR on MP3D, 70.5% SR on HM3D-v1, and 82.2% SR on HM3D-v2. The same 2B VLM checkpoint further transfers zero-shot to wheeled, quadruped, and humanoid robots without additional VLM fine-tuning. Project Page
	Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Huiyu Zhou, Jinchang Ren, Shiming Xiang, Xiangtai Li, Guangliang Cheng Remote Sensing, 2026 This is the first survey to systematically review approximately 120+ Mamba-based studies in remote sensing. It provides a structured overview of the field, covering preliminary knowledge, Vision Mamba backbones, micro- and macro-level architectural advancements, downstream applications, and key challenges and future research directions. Paper / Repository
	FTCFormer: Fuzzy Token Clustering Transformer for Image Classification Muyi Bao^, Changyu Zeng^, Yifan Wang, Zhengni Yang, Zimu Wang, Guangliang Cheng, Jun Qi, Wei Wang ^* Equal contribution European Conference on Artificial Intelligence (ECAI), 2025 FTCFormer introduces a semantic-aware, clustering-based token downsampling mechanism that dynamically allocates more tokens to informative image regions and fewer tokens to less important areas. Evaluated on 32 image classification datasets across diverse domains, FTCFormer consistently outperforms the TCFormer baseline, achieving average accuracy improvements of 1.43% on fine-grained datasets, 1.09% on natural image datasets, 0.97% on medical datasets, and 0.55% on remote sensing datasets. Paper / Code
	NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities Changyu Zeng^, Yifan Wang^, Zimu Wang^, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Anh Nguyen, Yutao Yue ^ Equal contribution Findings of EMNLP, 2025 NUMINA is a large-scale benchmark for evaluating fine-grained spatial understanding and numerical reasoning in 3D indoor environments. It contains 74,526 question-answer pairs across fact validation, prompt matching, and numerical inference tasks, with 62.0% of the questions involving quantity, distance, or volume reasoning. Human inspection of 20,000 samples achieves a 99.5% correctness rate, while evaluated models obtain below 3% TA@5 on precise distance and volume estimation, demonstrating the substantial limitations of current models in 3D numerical reasoning. Paper
	ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Qi Zhao, Changyu Zeng, Wenpei Bai, Guangliang Cheng arXiv, 2025 ASP-VMUNet introduces a hybrid CNN-Mamba architecture for skin lesion segmentation, featuring: 1) an atrous scan strategy to suppress background interference and enlarge the receptive field; 2) a shift-round operation for cross-segment feature interaction; and 3) attention-based fusion of local and global features. Evaluated on PH2 and ISIC 2016-2018, ASP-VMUNet achieves state-of-the-art performance, outperforming the second-best method by 1.08%, 1.02%, 1.36%, and 0.21% in mIoU across the four datasets, respectively. Paper / Code
	Comparative Performance Analysis of Rendering Optimization Methods in Unity Tuanjie Engine, Unity Global and Unreal Engine Muyi Bao, Zeren Tao, Xiaohan Wang, Jiashuo Liu, Qilei Sun IEEE International Conference on Ubiquitous Intelligence and Computing (UIC), 2024 This work presents a systematic comparison of three geometry rendering optimization systems: Unity Global's Level of Detail, Tuanjie Engine's Virtual Geometry, and Unreal Engine 5's Nanite. Using benchmark scenes with ten 3D models containing three million polygons each, the study evaluates FPS, GPU, CPU, and memory usage across multiple viewing distances. The results show that Nanite achieves the best overall performance, while Tuanjie's Virtual Geometry outperforms Unity's LOD for distant objects but performs worse at close viewing distances. Paper / Repository
	AlexCapsNet: An Integrated Architecture for Image Classification with Background Noise Muyi Bao, Ming Xu, Nanlin Jin IEEE Access, 2025 AlexCapsNet integrates AlexNet-based deep feature extraction with CapsNet to improve classification on complex images with background noise. Across MNIST, Fashion-MNIST, SVHN, and CIFAR-10, it achieves accuracies of 99.66%, 93.27%, 95.33%, and 83.67%, outperforming the original CapsNet by an average of 5.16% accuracy. Experiments on seven datasets further show that removing CapsNet's reconstruction module improves robustness on datasets containing complex backgrounds. Paper / Code