ResGCN
Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition
Song2020ResGCN
Introduction
기존에 제안된 많은 방법들은 굉장히 많은 파라미터수와 메모리를 요구한다. 일례로 2s-AGCN의 경우 NTU60을 학습할 시, 4개의 GPU에소 694만개의 파라미터를 사용한다. DGCNN의 경우 2,600만개의 파라미터를 필요한다. 이처럼 기존 방법들은 상당한 양의 시스템 자원을 필요로 한다.
이를 극복하기 위해 저자는 몇가지 유명한 알고리즘들을 조합하여 GCN 기반의 baseline network를 제안한다.
- an early fused Multiple Input Branches (MIB) architecture is proposed to capture rich spatial configurations and temporal dynamics from skeleton data, where the three branches include joint positions (relative and absolute), bone features (lengths and angles) and motion velocities (one or two temporal steps) respectively,
- we introduce a Residual GCN (ResGCN) module, where the residual links make the model optimization.
- a Part-wise Attention (PartAtt) module is proposed to discover the most essential body parts over a whole action sequence.
Limiations:
- The current SOTA models are often exceedingly sophisticated and over-parameterized
- Too expensive for training and testing
Contributions:
- An early fused multi-branch architecture is designed to take inputs from three individual spatio-temporal feature sequences (Joint, Velocity and Bone) obtained from raw skeleton data, which enables the baseline model to extract sufficient structural features.
- To further enhance the efficiency of our model, a residual bottleneck structure is introduced in GCN, where the residual links reduce the difficulties in model training and the bottleneck structure reduces the computational costs in parameter tuning and model inference.
- A part-wise attention block is proposed to compute attention weights for different human body parts to further improve the discriminative capability of the features, which meanwhile provides an explanation for the classification results through visualizing the class activation maps.
- Extensive experiments are conducted on two large-scale skeleton action datasets, i.e., NTU RGB+D 60 and 120, where the PA-ResGCN can achieve the SOTA performance, and the ResGCN with bottleneck structure obtains competitive performance with much fewer parameters.
Method
System architecture Overview
1. Joints, Velociteis, Bones 각각 입력받은 후 CNN feature를 추출함.
2. fusion (concatenate) 한후 중간 모델에 입력함.
Data Preprocessing
$ x \; in \; \mathrel{R}^{C*T*V}$
1) Joint positons
$\mathcal{R} = \{ \mathcal{r}_{i} \; | \: i= 1,2,3,...,V \} $
where $\mathcal{r}_{i} = x[ :,:,i] - x[:,:,c]$
and $x[:,:,c]$ is the center joint.
2) Motion velocity
$\mathcal{F} = \{ \mathcal{f}_{t} \; | \: t= 1,2,3,...,T \} $
where $\mathcal{f}_{t} = x[ :,t+2,i] - x[:,t,:]$
$\mathcal{S} = \{ \mathcal{s}_{t} \; | \: t= 1,2,3,...,T \} $
where $\mathcal{s}_{t} = x[ :,t+1,i] - x[:,t,:]$
3) Bone length and angle
3-1) length
$\mathcal{L} = \{ \mathcal{l}_{i} \; | \: i= 1,2,3,...,V \} $
where $\mathcal{l}_{i} = x[ :,:,i] - x[:,:,i_{adj}]$
(다만, $x[:,:,i_{adj}]$가 인접노드 전부인지 한개 인지는 모르겠다.)
3-2) angle
$\mathcal{A} = \{ \mathcal{a}_{i} \; | \: i= 1,2,3,...,V \} $
where $a_{i,w} = arccos(\frac{l_{i,w}}{ l_{i,x}^2 + l_{i,y}^2 + l_{i,z}^2})$
Attention Blcok
그림. 3처럼 body part별로 pooling-> concat->W 하나, 수식처럼 concat -> pooling-> W 하나 동일한것으로 보임.
왜냐하면 시간축에 대한 pooling이므로 joint별로 독립적이다. 참고로 $W$ 는 모든 part에서 공유되며 $W_{p}$는 part별로 따로 사용한다. (다만 $F_{in}W$ 과 $W_{p}$ 사이의 연산이 어떻게 이루어지는지는 잘 모르겠다.)
최종적으로는 각 part별로 attention을 곱한 후 최종 결과물이 나온다.
Attention Layer
ST-GCN의 각 Layer에 뒷단에 attention block 붙여서 Layer를 구성하며 이는 residual conection도 포함한다. (그림.2 참고)
Results