SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

Efficient and accurate 3D human pose estimation using structure-aware stride state space model

Abstract

Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies.

To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity.

Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models.

Method Introduction

Comparison of Pose Sequence Processing

Comparison of Pose Sequence Processing: Flattened Scan vs. Structure-Aware Scan

This figure illustrates the key difference between traditional flattened scan approaches and our proposed structure-aware scan strategy. The upper part shows how existing methods flatten 2D pose sequences into purely temporal sequences, which disrupts the inherent spatial structure of human poses. The lower part demonstrates our SAS-SSM approach, which maintains the spatial relationships between joints while enabling efficient multi-scale global structural representations.

Overall Framework of SasMamba

SasMamba Overall Framework

Architecture Overview

The overall framework of SasMamba follows a streamlined pipeline designed for efficient 3D pose estimation. The process begins with input preprocessing, followed by SasMamba block processing, and concludes with output generation.

1

Input Preprocessing

A fully connected layer projects the input keypoint sequence into a high-dimensional space, followed by positional and temporal embeddings.

2

SasMamba Block Processing

The processed sequence is fed into SasMamba blocks, which apply structure-aware spatiotemporal convolution and multi-scale, multi-directional scan.

3

Output Generation

After fusing local and global multi-scale topological features, the estimation is performed, ensuring both local accuracy and global smoothness.

Method Comparison

Quantitative Results on Human3.6M Dataset

Method Venue Seq2Seq T Parameter MACs MACs/frames P1(mm) ↓ P2(mm) ↓ P1(mm)† ↓
Small/Lightweight Models
P-STMO ECCV'22 243 6.2M 0.7G 3M 42.8 34.4 29.3
STCFormer CVPR'23 243 4.7M 19.6G 80M 41.0 32.0 22.0
GLA-GCN ICCV'23 243 1.3M 1.5G 6M 44.4 34.8 21.0
HDFormer IJCAI'23 96 3.7M 0.6G 6M 42.6 33.1 21.6
MotionAGFormer-XS WACV'24 27 2.2M 1.0G 37M 45.1 36.9 28.1
MotionAGFormer-S WACV'24 81 4.8M 6.6G 81M 42.5 35.3 26.5
HGMamba-XS IJCNN'25 27 2.8M 1.14G 42M 44.9 38.3 29.5
HGMamba-S IJCNN'25 81 6.1M 8.02G 99M 42.8 35.9 22.9
PoseMamba-S AAAI'25 243 0.9M 3.6G 15M 41.8 35.0 22.0*
SasMamba - 243 0.64M 1.3G 5M 41.48 34.84 21.44
Larger Models (Reference)
MixSTE CVPR'22 243 33.6M 139.0G 572M 40.9 32.6 21.6
PoseFormerV2 CVPR'23 243 14.3M 128.2G 528M 45.2 35.6 -
MotionBERT ICCV'23 243 42.3M 174.8G 719M 39.2 32.9 17.8
KTPFormer CVPR'24 243 33.7M 69.5G 286M 40.1 31.9 19.0
MotionAGFormer-B WACV'24 243 11.7M 48.3G 198M 38.4 32.6 19.4
MotionAGFormer-L WACV'24 243 19.0M 78.3G 322M 38.4 32.5 17.4
PoseMamba-B AAAI'25 243 3.4M 13.9G 57M 40.8 34.3 16.8
PoseMamba-L AAAI'25 243 6.7M 27.9G 115M 38.1 32.5 15.6
SasMamba-large - 243 4.1M 8.56G 35M 39.77 33.61 20.92

Legend: T = Number of input frames. Seq2Seq = Sequence-to-sequence estimation. MACs/frames = Multiply-accumulate operations per output frame. P1 = MPJPE error (mm). P2 = P-MPJPE error (mm). P1† = P1 error on 2D ground truth. Bold = Best result, Underlined = Second-best result. * = Reproduced results.

Main Visualization

3D Pose Estimation Result 1

Detailed comparison of estimation results of various lightweight networks (on the evaluation set of Human3.6M.)

Qualitative comparisons of our proposed SasMamba with PoseMamba-S, MotionAGFormer-S, and HGMamba-S on 3D human pose estimation. The solid purple skeletons represent the ground-truth 3D poses, while the dashed green skeletons indicate the predicted 3D poses.

3D Pose Estimation Result 2

Detailed comparison of estimation results of various lightweight networks (on in-the-wild videos)

Qualitative comparisons with PoseMamba-S, MotionAGFormer-S, and HGMamba-S on challenging in-the-wild videos. Red arrows indicate accurate estimations, while gray arrows highlight unsatisfactory estimations.

Video Demonstration

Real-time 3D Pose Estimation: Sample 1

Real-time 3D Pose Estimation: Sample 2

Real-time 3D Pose Estimation: Sample 3

Real-time 3D Pose Estimation: Sample 4

Video Comparison for sample 1

SassMamba-s(ours)

PoseMamba-s

MotionAGFormer-s

HGMamba-S



Drag to compare frames

Video Comparison for sample 2

SassMamba-s(ours)

PoseMamba-s

MotionAGFormer-s

HGMamba-S



Drag to compare frames

Video Comparison for sample 3

SassMamba-s(ours)

PoseMamba-s

MotionAGFormer-s

HGMamba-S



Drag to compare frames

Video Comparison for sample 4

SassMamba-s(ours)

PoseMamba-s

MotionAGFormer-s

HGMamba-S



Drag to compare frames

Video Comparison for sample 5

SassMamba-s(ours)

PoseMamba-s

MotionAGFormer-s

HGMamba-S



Drag to compare frames

Paper Information

SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

Author1, Author2, Author3

Conference Name, 2024

BibTeX

@article{sasmamba2024,
  title={SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation},
  author={Author1, Author2, Author3},
  journal={WACV},
  year={2025}
}