Efficient and accurate 3D human pose estimation using structure-aware stride state space model
Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies.
To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity.
Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models.
This figure illustrates the key difference between traditional flattened scan approaches and our proposed structure-aware scan strategy. The upper part shows how existing methods flatten 2D pose sequences into purely temporal sequences, which disrupts the inherent spatial structure of human poses. The lower part demonstrates our SAS-SSM approach, which maintains the spatial relationships between joints while enabling efficient multi-scale global structural representations.
The overall framework of SasMamba follows a streamlined pipeline designed for efficient 3D pose estimation. The process begins with input preprocessing, followed by SasMamba block processing, and concludes with output generation.
A fully connected layer projects the input keypoint sequence into a high-dimensional space, followed by positional and temporal embeddings.
The processed sequence is fed into SasMamba blocks, which apply structure-aware spatiotemporal convolution and multi-scale, multi-directional scan.
After fusing local and global multi-scale topological features, the estimation is performed, ensuring both local accuracy and global smoothness.
| Method | Venue | Seq2Seq | T | Parameter | MACs | MACs/frames | P1(mm) ↓ | P2(mm) ↓ | P1(mm)† ↓ |
|---|---|---|---|---|---|---|---|---|---|
| Small/Lightweight Models | |||||||||
| P-STMO | ECCV'22 | ✗ | 243 | 6.2M | 0.7G | 3M | 42.8 | 34.4 | 29.3 |
| STCFormer | CVPR'23 | ✓ | 243 | 4.7M | 19.6G | 80M | 41.0 | 32.0 | 22.0 |
| GLA-GCN | ICCV'23 | ✗ | 243 | 1.3M | 1.5G | 6M | 44.4 | 34.8 | 21.0 |
| HDFormer | IJCAI'23 | ✓ | 96 | 3.7M | 0.6G | 6M | 42.6 | 33.1 | 21.6 |
| MotionAGFormer-XS | WACV'24 | ✓ | 27 | 2.2M | 1.0G | 37M | 45.1 | 36.9 | 28.1 |
| MotionAGFormer-S | WACV'24 | ✓ | 81 | 4.8M | 6.6G | 81M | 42.5 | 35.3 | 26.5 |
| HGMamba-XS | IJCNN'25 | ✓ | 27 | 2.8M | 1.14G | 42M | 44.9 | 38.3 | 29.5 |
| HGMamba-S | IJCNN'25 | ✓ | 81 | 6.1M | 8.02G | 99M | 42.8 | 35.9 | 22.9 |
| PoseMamba-S | AAAI'25 | ✓ | 243 | 0.9M | 3.6G | 15M | 41.8 | 35.0 | 22.0* |
| SasMamba | - | ✓ | 243 | 0.64M | 1.3G | 5M | 41.48 | 34.84 | 21.44 |
| Larger Models (Reference) | |||||||||
| MixSTE | CVPR'22 | ✓ | 243 | 33.6M | 139.0G | 572M | 40.9 | 32.6 | 21.6 |
| PoseFormerV2 | CVPR'23 | ✗ | 243 | 14.3M | 128.2G | 528M | 45.2 | 35.6 | - |
| MotionBERT | ICCV'23 | ✓ | 243 | 42.3M | 174.8G | 719M | 39.2 | 32.9 | 17.8 |
| KTPFormer | CVPR'24 | ✓ | 243 | 33.7M | 69.5G | 286M | 40.1 | 31.9 | 19.0 |
| MotionAGFormer-B | WACV'24 | ✓ | 243 | 11.7M | 48.3G | 198M | 38.4 | 32.6 | 19.4 |
| MotionAGFormer-L | WACV'24 | ✓ | 243 | 19.0M | 78.3G | 322M | 38.4 | 32.5 | 17.4 |
| PoseMamba-B | AAAI'25 | ✓ | 243 | 3.4M | 13.9G | 57M | 40.8 | 34.3 | 16.8 |
| PoseMamba-L | AAAI'25 | ✓ | 243 | 6.7M | 27.9G | 115M | 38.1 | 32.5 | 15.6 |
| SasMamba-large | - | ✓ | 243 | 4.1M | 8.56G | 35M | 39.77 | 33.61 | 20.92 |
Legend: T = Number of input frames. Seq2Seq = Sequence-to-sequence estimation. MACs/frames = Multiply-accumulate operations per output frame. P1 = MPJPE error (mm). P2 = P-MPJPE error (mm). P1† = P1 error on 2D ground truth. Bold = Best result, Underlined = Second-best result. * = Reproduced results.
Qualitative comparisons of our proposed SasMamba with PoseMamba-S, MotionAGFormer-S, and HGMamba-S on 3D human pose estimation. The solid purple skeletons represent the ground-truth 3D poses, while the dashed green skeletons indicate the predicted 3D poses.
Qualitative comparisons with PoseMamba-S, MotionAGFormer-S, and HGMamba-S on challenging in-the-wild videos. Red arrows indicate accurate estimations, while gray arrows highlight unsatisfactory estimations.
Conference Name, 2024
@article{sasmamba2024,
title={SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation},
author={Author1, Author2, Author3},
journal={WACV},
year={2025}
}