MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , classification(1)

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , classification(1)

Paper review 2021. 1. 15. 22:05
Task : Image classification, object detection

Dataset : ImageNet, COCO

Goal : 기존에 경량화를 시도하는 model들은 주로 network의 size에 집중했는데, 저자들은 speed를 타겟으로해서 경량화를 시도 (depthwise separable convolution 적용)

Contribution:

- Xception에서 소개한 depthwise separable convolution을 사용하여 성능은 살짝 줄어들어도 연산 속도를 대폭 줄임(paramter 대폭 감소) (논문 제목에도 있듯이 molibe vision application에 적용 가능성을 보여줌)

Abstract

저자들은 mobile같이 소형 device나 여러 vision application들에 효율적으로 사용할 수 있는 model을 제시하였다.

model 이름은 MobileNets으로 depthwise separable convolution을 사용하여 경량화를 시도하였다.

그리고 두가지 hypre parameters가 존재하는데(latency & accuracy : trade off), 해결해야 할 문제의 제약조건에 기반한 appliaction들에 알맞은 model size를 조정할 수 있게 해준다.

performance 평가는 ImageNet classification과 object detection, finegrain classification, face attributes and large scale geo-localization과 같은 wide한 application에서 살펴볼 것이다

1. Introduction

CNN method들은 2012년 AlexNet 이후로 computer vision분야에서 어디든지 사용되는 model이 되었다.

일반적인 높은 accuracy를 얻기위해 깊고 복잡한 model을 만드는게 trend라고 할 수 있다. 하지만 real world에서 대부분의 application들을 보면 computation에 제한이 있는 경우가 많다.

그래서 이 논문에서는 매우 느리고 낮은 latency model들을 moblie이나 embedded vision application에 적용을 위해 쉽게 match가능하도록 효율적인 network architecture와 Two hyperparameters를 설명할 것이다.

2. Prior Work

경량화를 위한 많은 접근법들은 주로 network size에 초점을 맞추는데, 이 논문은 latency, size에 제약에 맞추어 small network를 구성할 수 있는 model을 제안한다. MoblieNets은 latency를 optimization하는 것에 초점하는 뿐만 아니라 small network도 만들게 된다. (speed에 초점)

다양한 application에 이용할 수 있다고 한다

3. MobilNet Architecture

Mobilnet은 standard convolution을 depthwise convolution과 pointwise convolution이라 호칭하는 1x1 convolution으로 factorize해주는 형태의 conv layer인 depthwise separable convolution에 기반한 model이다

"우리가 알고있는 일반적인 convolution과 다르게 depthwise sparable convolution은 두 개의 layer로 나뉜다"

이러한 factorization은 model size와 computation의 drastically한 감소 효과를 준다.

논문에서는 다음 figure들 에서 구조적인 차이를 보여주는데, 하나하나 살펴보도록 하자

Standard Convolution Filters

먼저 standard convolution을 보면, 일반적인 conv 구조임을 알 수 있다.

이제 output feature map과 computational cost를 구해보면 각각, 다음과 같이 표현할 수 있다.

output feature map

computational cost

여기서 D_k는 kernel size이고 N은 output channels, M은 input channels이며, D_F는 feature map size이다.

Depthwise Convolutional Filters

다음은 Depthwise Convolutional Filters이다.

(a)와 다르게 각 filter들이 하나의 channel을 가지고 있음을 볼 수 있다.

(b)도 output feature map과 computational cost를 구해보면 각각, 다음과 같이 표현할 수 있다.

output feature map

computational cost

각 filter당 채널이 1이므로 위의 식처럼 구해지는 것을 알 수 있다.

Depthwise convolution은 standard convolution에 비해 효율적이지만, filter들이 input channel만을 가지고 있으므로 새로운 feature map을 만들어줄 수 없다. 이때 pointwise convolutional filter를 depthwise convolution의 output을 linear combination하여 계산된 결과에 적용한다

Pointwise Convolutional Filters

pointwise convolutional filters

pointwsie convolutional filters는 위와 같이 1x1 convolution임을 알 수 있다.

그래서 depthwise + pointwise layer로 구성된 depthwise separable convolution의 computational cost는

다음과 같이 표현할 수 있다.

좀 더 전체적인 구성을 요약하자면,

https://towardsdatascience.com/review-xception-with-depthwise-separable-convolution-better-than-inception-v3-image-dc967dd42568

위와 같은 그림으로 표현할 수 있다. 즉, 채널별로 분리하여 각 filter별로 conv를 통과시키고, pointwise conv를 통해 output channel을 조정해줄 수 있다. (minimin2.tistory.com/42)

마지막으로 standard convolution과 cost를 비교해보면,

논문에서는 3x3 depthwise separable convolutions을 사용하고 8-9배의 감소효과를 보여준다고 한다.

전체적인 network architecture는 다음과 같다.

마지막 layer를 제외한 모든 layer은 Batch normalization과 ReLU를 적용시켰다

이제 논문에서는 model size를 줄여주는 두 개의 parameter에 대해 설명해준다

Width Multiplier : Thinner Models

α = 1이면 baseline MolibeNet이라 하며, 범위는 0 <= α < 1로 설정할 수 있다. Thinner model이란 말처럼 input/output channel에 곱해져서 model size를 줄여준다

Resolution Multipler : Reduced Representation

두 번째 parameter인 Resolution Multipler는 ρ로 표현되며, network의 해상도를 줄여서 computational cost를 감소시킨다. 이 두개의 parameter를 적용시킨 model의 computaional cost를 계산하면, 다음과 같다. 마찬가지로 범위는 ρ ∈ (0, 1]로 설정할 수 있다.

실제로 예시를 보면 일반적인 convolution과 비교해서 depthwise separable conv과 거기에 각각 α, ρ를 적용시킨 model은 상당히 parameter가 줄어든 것을 볼 수 있다.

4. Experiments

이 section에서는 depthwise convolution에 효과를 검증하는데, 성능은 비슷하거나 살짝 떨어지지만 parameter를 굉장히 줄여주는 장점을 보여준다.

먼저 model size를 조절해주는 parameter들의 결과를 확인해보면,

당연한 얘기일 수 있겠지만 1.0 MoblieNet-224 (α=1, ρ=1)이 acc는 가장 높았고, Table4를 보면 standard convolution을 사용한 model보다 parameter는 굉장히 줄어들면서도 acc는 1%정도 차이나는 것을 알수있다.

다음은 ImageNet benchmark에서 VGG16, GoogleNet과의 비교이다.

accuracy가 크게 차이나지 않으면서도 동시에 parameter가 많이 감소되어, 어느정도 성능과 함께 model 경량화가 가능함을 보여줬다.

이러한 예시 말고도, Object Detection과 Face Attributes 등의 다양한 application에도 적용이 가능하며 성능또한 준수함을 검증하였다.

5. Conclusion

논문에서는 depthwise separable convolution을 기반으로한 MobileNet을 제시하였다. 이는 model을 경량화해 작고 빠른 구조를 개발하였으며, 두 개의 parameter로 model size를 조절하여 latency와 accuracy의 trade-off관계에서 합리적인 값을 찾아내었다.

결과적으로 굉장히 많은 양의 parameter수를 줄여주면서도, 준수한 성능을 보이는 model를 제시하였다.

Reference

arxiv.org/pdf/1704.04861.pdf arxiv.org/pdf/1704.04861.pdf%EF%BB%BF

towardsdatascience.com/review-xception-with-depthwise-separable-convolution-better-than-inception-v3-image-dc967dd42568

Review: Xception — With Depthwise Separable Convolution, Better Than Inception-v3 (Image…

In this story, Xception [1] by Google, stands for Extreme version of Inception, is reviewed. With a modified depthwise separable…

towardsdatascience.com

minimin2.tistory.com/42
댓글

ABOUT ME

잠이 부족한 대학원생 잠이 부족한 대학원생

Abstract

1. Introduction

2. Prior Work

3. MobilNet Architecture

4. Experiments

5. Conclusion

Reference

티스토리툴바