SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

Abstract

Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies.

To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity.

Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models.

Method Introduction

Comparison of Pose Sequence Processing: Flattened Scan vs. Structure-Aware Scan

This figure illustrates the key difference between traditional flattened scan approaches and our proposed structure-aware scan strategy. The upper part shows how existing methods flatten 2D pose sequences into purely temporal sequences, which disrupts the inherent spatial structure of human poses. The lower part demonstrates our SAS-SSM approach, which maintains the spatial relationships between joints while enabling efficient multi-scale global structural representations.

Overall Framework of SasMamba

Architecture Overview

The overall framework of SasMamba follows a streamlined pipeline designed for efficient 3D pose estimation. The process begins with input preprocessing, followed by SasMamba block processing, and concludes with output generation.

Input Preprocessing

A fully connected layer projects the input keypoint sequence into a high-dimensional space, followed by positional and temporal embeddings.

SasMamba Block Processing

The processed sequence is fed into SasMamba blocks, which apply structure-aware spatiotemporal convolution and multi-scale, multi-directional scan.

Output Generation

After fusing local and global multi-scale topological features, the estimation is performed, ensuring both local accuracy and global smoothness.

Method Comparison

Quantitative Results on Human3.6M Dataset

Method	Venue	Seq2Seq	T	Parameter	MACs	MACs/frames	P1(mm) ↓	P2(mm) ↓	P1(mm)† ↓
Small/Lightweight Models
P-STMO	ECCV'22	✗	243	6.2M	0.7G	3M	42.8	34.4	29.3
STCFormer	CVPR'23	✓	243	4.7M	19.6G	80M	41.0	32.0	22.0
GLA-GCN	ICCV'23	✗	243	1.3M	1.5G	6M	44.4	34.8	21.0
HDFormer	IJCAI'23	✓	96	3.7M	0.6G	6M	42.6	33.1	21.6
MotionAGFormer-XS	WACV'24	✓	27	2.2M	1.0G	37M	45.1	36.9	28.1
MotionAGFormer-S	WACV'24	✓	81	4.8M	6.6G	81M	42.5	35.3	26.5
HGMamba-XS	IJCNN'25	✓	27	2.8M	1.14G	42M	44.9	38.3	29.5
HGMamba-S	IJCNN'25	✓	81	6.1M	8.02G	99M	42.8	35.9	22.9
PoseMamba-S	AAAI'25	✓	243	0.9M	3.6G	15M	41.8	35.0	22.0*
SasMamba	-	✓	243	0.64M	1.3G	5M	41.48	34.84	21.44
Larger Models (Reference)
MixSTE	CVPR'22	✓	243	33.6M	139.0G	572M	40.9	32.6	21.6
PoseFormerV2	CVPR'23	✗	243	14.3M	128.2G	528M	45.2	35.6	-
MotionBERT	ICCV'23	✓	243	42.3M	174.8G	719M	39.2	32.9	17.8
KTPFormer	CVPR'24	✓	243	33.7M	69.5G	286M	40.1	31.9	19.0
MotionAGFormer-B	WACV'24	✓	243	11.7M	48.3G	198M	38.4	32.6	19.4
MotionAGFormer-L	WACV'24	✓	243	19.0M	78.3G	322M	38.4	32.5	17.4
PoseMamba-B	AAAI'25	✓	243	3.4M	13.9G	57M	40.8	34.3	16.8
PoseMamba-L	AAAI'25	✓	243	6.7M	27.9G	115M	38.1	32.5	15.6
SasMamba-large	-	✓	243	4.1M	8.56G	35M	39.77	33.61	20.92

Legend: T = Number of input frames. Seq2Seq = Sequence-to-sequence estimation. MACs/frames = Multiply-accumulate operations per output frame. P1 = MPJPE error (mm). P2 = P-MPJPE error (mm). P1† = P1 error on 2D ground truth. Bold = Best result, Underlined = Second-best result. * = Reproduced results.

Main Visualization

Detailed comparison of estimation results of various lightweight networks (on the evaluation set of Human3.6M.)

Qualitative comparisons of our proposed SasMamba with PoseMamba-S, MotionAGFormer-S, and HGMamba-S on 3D human pose estimation. The solid purple skeletons represent the ground-truth 3D poses, while the dashed green skeletons indicate the predicted 3D poses.

Detailed comparison of estimation results of various lightweight networks (on in-the-wild videos)

Qualitative comparisons with PoseMamba-S, MotionAGFormer-S, and HGMamba-S on challenging in-the-wild videos. Red arrows indicate accurate estimations, while gray arrows highlight unsatisfactory estimations.

Paper Information