Single Image Depth Estimation With Integration of Parametric Learning and Non-Parametric Sampling

(1)

1. INTRODUCTION

Understanding 3D structure from a single image is a central problem receiving increased attention due to its wide applicability in many vision appli- cations [31, 32]. The 3D structure of a scene pro- vides decisive clues, enabling major breakthroughs in several tasks such as, scene labeling [1], object detection [2], and intrinsic image estimation [3].

Indeed, humans have remarkable ability to infer scene structure in monocular situations. This is because a prior information for understanding 3D structure can be learned from various monocular depth cues, including motion, occlusion, and shad- ing. However, replicating this mechanism with machines is not trivial. It is difficult to discover the relation of numerous monocular cues under-

lying the image.

There have been many attempts to estimate depth information from single images. We roughly classify the existing methods into two groups:

semi-automatic and automatic approaches. The former expects human interaction for depth esti- mation. Conventional semi-automatic methods re- quire highly labor intensive procedures, so it can make a good performance. Actually, many 3D films have been converted from 2D ones as following steps: separating objects in individual frames, specifying proper depths, and correcting errors af- ter final rendering. To reduce the burden of human interactions, scribble-based methods have been in- troduced [4, 5]. Predicted depth is obtained from sparse depth scribble through optimization based propagation [4]. Non-local random walks [5] can

Single Image Depth Estimation With Integration of Parametric Learning and Non-Parametric Sampling

Hyungjoo Jung

^†

, Kwanghoon Sohn

^††

ABSTRACT

Understanding 3D structure of scenes is of a great interest in various vision-related tasks. In this paper, we present a unified approach for estimating depth from a single monocular image. The key idea of our approach is to take advantages both of parametric learning and non-parametric sampling method.

Using a parametric convolutional network, our approach learns the relation of various monocular cues, which make a coarse global prediction. We also leverage the local prediction to refine the global prediction.

It is practically estimated in a non-parametric framework. The integration of local and global predictions is accomplished by concatenating the feature maps of the global prediction with those from local ones.

Experimental results demonstrate that the proposed method outperforms state-of-the-art methods both qualitatively and quantitatively.

Key words: Depth Estimation, Convolutional Neural Network, Fully Convolutional Network, Non- Parametric Method

※ Corresponding Author : Kwanghoon Sohn, Address:

C129, The 3rd Engineering Building, Yonsei University, 50 Yonsei-ro, Seodaemun-Gu, Seoul 120-749, Korea, TEL : +82-2-2123-2879, FAX : +82-2-2123-7712, E-mail : [email protected]

Receipt date : May 26, 2016, Revision date : July 17, 2016 Approval date : Aug. 5, 2016

††

School of Electrical & Electronic Engineering, Yonsei University (E-mail: [email protected])

††

School of Electrical & Electronic Engineering, Yonsei University

※ This work was supported by the National Research

Foundation of Korea(NRF) grant funded by the Korea

government(MSIP) (NRF-2013R1A2A2A01068338)

(2)

reduce the number of user scribbles. However, scribble-based methods are not suitable for auto- matic vision-related tasks.

Contrary to semi-automatic approaches, auto- matic depth estimation methods do not require any user interactions. For example, they estimate depth automatically using depth cues, such as shape from shading [6], and structure from motion [7].

However, their applications are limited to restricted environments i.e., moving cameras or static scenes.

After the emergence of large scale RGB-D data- sets, data-driven approaches for depth estimation are received lots of attention. They can be classi- fied into two categories: non-parametric sampling and parametric learning. Non-parametric sampling methods learn depth from retrieved depths. The depth candidates are visually similar with query image. They assume that visually similar scenes have similar geometric structures. Visually similar RGB-D candidates are selected by hand-craft fea- tures, such as GIST [8] and HOG [9]. In [10] and [11], warping process is performed to align boun- daries of depth with those of query image.

Contrary to non-parametric sampling methods, parametric learning methods make a model to de- scribe the relationship between a scene and corre- sponding depth. In [12], Markov Randomn Field (MRF) model is built to estimate depth map from a single image. The model is trained by monocular cues and the relation among multiple regions in the input image. Nowadays, convolutional neural net- works (CNNs) have succeeded in depth estimation from a single image. Multi-scale CNN framework for depth estimation is introduced in [13]. Also, jointly training continuous CRFs and deep CNNs framework [14] can estimate a plausible depth.

More recently, fully convolutional networks (FCNs) [15] are mainstream for depth estimation.

In this paper, we integrate non-parametric algo- rithm and CNNs for depth estimation from a single monocular scene. First, we estimate coarse depth map using FCN. Non-parametric sampling method

extracts warped depth maps, which their bounda- ries are aligned with query's ones. The warped depth maps are concatenated with the convolu- tional layer's activations of CNN for depth refine- ment. The network extracts a plausible depth prediction. This paper is organized as follows.

Section 2 presents related works. The proposed al- gorithm is explained in Section 3. Section 4 shows the implementation details and experimental results. Finally, we conclude the paper with some discussions and future works in Section 5.

2. RELATED WORKS

Recently, several methods have been proposed to directly estimate depth from a single image.

Among all the interesting works, we review and discuss two related works that are most relevant to ours.

2.1 Non-parametric Sampling Methods

Non-parametric sampling methods have been received lots of attention after the emergence of large RGB-D datasets. They retrieve visually sim- ilar scenes with query image under an assumption that visually similar scenes also have similar geo- metric structures. Depth estimation is occurred from retrieved images. Depth Transfer [10] selects the candidates using GIST feature [8] from the RGB-D database. To align depth boundaries with query image's boundaries, warped depth maps are obtained by SIFT flows [16] between query image and RGB candidates. A global optimization is then performed to regress the warped depth maps with robust potential function. However, calculating SIFT flows and solving the optimization problem require very expensive operation. Choi et al. [11]

adapt Patch Match [17] for efficient warping

process. The method, which is called Depth Anal-

ogy, estimates depth gradient rather than directly

estimate depth map. The final depth prediction is

obtained from Poisson reconstruction. The frame-

(3)

work is reasonable due to the statistical invariance property of depth gradients.

Contrary to Depth Transfer and Depth Analogy, Konrad et al. [18] proposes depth estimation meth- od without warping process. They assume that the location of some objects (i.e., sky, building, furni- ture) are quite consistent with the candidate images. Based on the assumption, the inferred depth values come from median of retrieved depth maps. A joint filtering is then executed to align depth boundaries to those of the input image.

However, if the candidate depths are not locally consistent with query image, it might fail to esti- mate proper depth values from median.

2.2 Parametric Learning Methods

Another group of researchers focuses on defin- ing a parametric model between RGB and depth.

Saxena et al. [12] proposes Markov Random Field (MRF) model to estimate depth map from a single image. A set of plane parameters is trained to cap- ture a relationship between RGB and depth using large scale of RGB-D dataset.

CNN-based depth estimation methods have been studied a lot. Eigen et al. [13] builds a multi-scale CNN networks to estimate depth directly. The net- work consists of two deep neural network stacks.

The first network performs coarse depth prediction in low resolution. The coarse prediction then con- catenates with the second network after bilinear interpolation. The network refines the coarse depth map. Liu et al. [14] propose a deep convolution neural network, which estimates depth using con- tinuous CRFs and deep CNNs. CRF network cap- tures the relationship between superpixels in an in- put image. Jointly training CRFs and CNNs is their huge contribution.

More recently, fully convolutional networks (FCNs) [15] have become a popular choice in se- mantic segmentation. FCNs can generate effective features, and be trained end-to-end manner. How- ever, FCNs extract coarse output because the net-

work just resizes low resolution into high reso- lution using bilinear interpolation. Chen et al. [19]

propose a framework which refines FCN output using the fully connected CRF for semantic seg- mentation. In [20], they replace the fully connected CRF into mordern domain transform. Likewise se- mantic segmentation, depth estimation methods based on FCN concentrate how to refine depth boundaries using CRF [21], or annotations of rela- tive depth [22].

Our proposed method is also based on FCN for depth estimation from a single monocular scene.

We integrate non-parametric sampling framework to refine coarse FCN output. Since we embed boundary information of non-parametric results into CNN for refinement, the final depth is able to contain both of global and local information.

3. PROPOSED METHOD

The proposed method consists of three compo- nents as follows: parametric model for coarse depth estimation, non-parametric sampling framework for warped depth maps, and CNN layers for depth refinement. The overall framework of the proposed method is shown in Fig. 1.

3.1 Parametric Depth Estimation Model: Fully Convolutional Network

The first component in the proposed method is

FCN, which produces a coarse depth map. We

modify VGG-16 net [23] to make FCN as shown

in Fig. 2. The upper layers in Fig. 2 come from

VGG-16 net. They consist of 10 convolutional, 10

ReLU, and 4 max-pooling layers. We also use pre-

trained parameters using ImageNet [24] as the ini-

tial ones. We concatenate last two features, which

have 1/8 and 1/16 scales of input image. The fea-

tures are resized into input's resolution using bi-

linear interpolation before concatenation. To obtain

better high-resolution prediction, we concatenate

the middle activations with the last ones. Since

(4)

middle activations perform less pooling process, they contain more structural information. Two convolutional layers are applied to obtain FCN output. We do not add ReLU after the last convolu- tional layer. We set a loss used in [13] as follows:

   

 

_

^

^

^{ }

^

^

^

^{ } _ ^

^

 ^





_

 

_

 

^

  

 

_

^ ^ ^

^

^

^

^{ }

^

^

^

^

^

^ ^ ^

^

^

^

^{ }

^

^

^

^

^

^ ^

(1) where ^ and ^ are FCN output and its ground- truth, respectively. In addition, ^



and ^



denote the horizontal and vertical difference matrix. ^ is the whole number of valid pixels. The first and second term are least square and scale-invariant difference, respectively. Also, the last term encour-

ages to have similar gradients between output and ground-truth. As mentioned in previous Sections, FCN produces coarse depth map. In our method, local structures in depth boundaries are refined with the output of non-parametric sampling method.

3.2 Non-Parametric Sampling Method

The most important thing in non-parametric sampling approaches is scene retrieval. In our method, we retrieve visually similar scenes using CNN descriptor [25]. It is 4096-dimensional feature vector. We confirm that CNN descriptor can out- perform in image retrieval comparing with hand- craft features such as GIST, or HOG. CNN de- scriptors of training set are pre-computed for effi- cient retrieval. Let ^



denote CNN descriptor of the image ^



in RGB-D database. Given the input im- age ^



and its descriptor ^



, the visual similarity is measured by the sum of squared difference (SSD) as follows:



_



_

  ∥ ^



 

_

∥

_^

^ (2) In our experiments, we select 7 RGB-D candi- dates. Fig. 3 shows query input and 7 visually sim- ilar scenes extracted by CNN descriptor. We can confirm that CNN descriptors can retrieve visually similar scenes, and they also consist of similar 3D Fig. 1. The overall framework of proposed method.

Fig. 2. Fully Convolutional Network (FCN) in propose

method.

(5)

structures. To align boundaries of depth with ones of input, we perform warping process using Patch Match [17], which is much faster than SIFT flow [16]. Although it is hard to estimate plausible depth maps based on non-parametric approach, we can obtain warped depth maps containing the in- formation of depth boundaries by warping process.

The warped depth maps and FCN output are used as input of depth refinement framework.

3.3 Depth Refinement by CNNs

Traditional approaches refine depth using joint filtering techniques such as joint bilateral filter [29], or guided filter [30]. They usually adapt guid- ance image as color ones. However, by the struc- tural inconsistency between RGB and depth, tex- ture copying problems are frequently occurred in refined depth. Nowadays, learning based appro- aches resolve the problem. Rather than using color as guidance image directly, the approaches use CNN layers for depth refinement. To effectively regress FCN output and warped depth maps, we build CNN layers as shown in the bottom of Fig.

1. We modify residual network [26]. Kim et al.

show that learning only residual can converge faster, and show superior performance. In addition, it can solve the vanishing/exploding gradients problem in backpropagation. We confirm that learning only residual can converge faster, and produce better performance than general deep networks. To give local structures of depth in the

network, we concatenate 7 warped depths with the first convolutional layer's activations. And then, the desired output is obtained by addition of the residual and FCN output. Likewise FCN, (1) is also used as a loss function.

4. EXPERIMENTAL RESULTS

We evaluate the proposed method with NYU Kinect V2 dataset [27]. The dataset is composed of 1449 indoor scenes and corresponding depths captured by Kinect sensor. We select 795 RGB-D pairs for training set, and 654 scenes are used as test set. The training set is used to train parame- ters of our deep neural networks. It is also used as RGB-D database for non-parametric sampling.

For learning CNN layers for depth refinement, 7 visually similar scenes are selected from the train- ing set except query image. The proposed method is implemented on a standard desktop with an NVIDIA Titan, based on the popular CNNs toolbox MatConvNet.

For training our network, training images are resized into 320×240 from 640×480 due to the prob- lem of GPU memory. Convolutional layers in FCN (upper layers in Fig. 2) are initialized from pre- trained VGG-16 net, and others are randomly ini- tialized with zero mean Gaussian random values.

All parameters in CNN for depth refinement are also randomly initialized. Learning rate is 10

^-4

in FCN. For CNN for depth refinement, we can use large learning rate 10

^-1

, because of residual learn-

(a) (b)

Fig. 3. (a) input image, (b) visually similar scenes retrieved by CNN descriptor.

(6)

ing. Momentum is 0.9, and batch size is 32 in all networks. In CNN refinement, we add batch nor- malization [28] layers after all convolutional layers except the last one. Training deep neural networks is difficult because the distribution of each layer's inputs changes during training, which is called as internal covariance shift. The problem slows down the training. Batch normalization solves the prob- lem by normalizing layer's input.

We compare our method with one non-para- metric method [18] and two parametric methods [12, 13]. For the quantitative evaluation, we meas- ure several kinds of errors, which are standard measures in depth estimation. Let ^ and ^ denote prediction depth and ground-truth, respectively. ^ is the number of test images, and ^ indicates pixel index. The measures are shown as follows:

average relative error (rel) : _ _ ^ 

_

^ _



 ^



 

_



 (3)

root mean squared error (rms) :  ^ ^ 

 

_

^

^

^{ }

^

^

^

^

(4) root mean squared in ^log error ( ^log ) :

 

 

_

^log

^

^

^

^{ log}

^

^

^

^

^

^ ⁽⁵⁾

accuracy with threshold (thr) : percentage (%) of ^



s.t. _max _





_

  

_



_

     (6) We analyze the performance of the proposed method comparing with competing algorithms. Fig.

4 shows results of ours. As mentioned previous Sections, FCN output (Fig. 4(c)) can estimate global depth values, but depth boundaries are very smooth. By aggregating FCN with warped depth

map by CNN, depth boundaries become much sharper. Furthermore, small objects, such as pro- jector, book shelves in the background, are refined as shown in Fig. 4(d). Since any refinement proc- ess using joint filtering techniques is not per- formed, our method can be more trustworthy. Fig.

5 illustrates the qualitative comparison with our method and competing methods. In general, non- parametric methods have poor performance in esti- mating depth. Methods of Eigen et al. can infer a plausible depth, but its outputs are very smooth.

On the other hands, our method not only estimates global depth values, but also preserves depth boundaries. Table 1 shows the quantitative evalu- ations with some measures. It also shows our method outperforms the non-parametric methods.

5. CONCLUSION

This paper has presented depth estimation method from a single monocular scene using popu- lar FCN and non-parametric method. FCN can ex- tract globally plausible depth, whereas outputs of non-parametric method preserve local structures, due to warping process based on Patch Match.

Each virtues is aggregated by CNN, which only learn residual. Since we adapt the newest techni- ques (i.e., batch normalization and residual net) in our deep networks learning, they converge fast and produce better results. Experimental results show that the proposed method can outperform other ex- isting methods in estimating both of global and lo- cal structures of depth. In experiments, we tried to use large RGB-D dataset. It is well-known that

Table 1. Quantitative evaluations with competing algorithms

Method Error (lower is better) Accuracy (higher is better)

rel rms log      

^

  

^

Make3d [12] 0.349 1.214 0.409 0.447 0.745 0.897

Depth Fusion [18] 0.312 1.191 0.398 0.581 0.771 0.851

Eigen et al. [13] 0.215 0.907 0.285 0.611 0.887 0.971

Ours (FCN) 0.204 0.894 0.279 0.654 0.841 0.932

Ours (Final) 0.194 0.811 0.211 0.712 0.887 0.984

(7)

(a) (b) (c) (d)

Fig. 4. Qualitative Evaluation (a) input image, (b) ground-truth, (c) FCN output, (d) ours.

(8)

large number of training images make a good per- formance in deep neural networks, and it helps to avoid overfitting problem. Non-parametric meth- ods can also make better performance in large database. However, large database makes it hard to retrieve training images in non-parametric method. Fast image search algorithms can improve the proposed method. Also, in future works, we will extend the proposed method by jointly training parametric and non-parametric methods.

REFERENCE

[ 1 ] X. Ren, L. Bo, and D. Fox, "RGB-(D) Scene Labeling: Features and Algorithms," Pro- ceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2759- 2766, 2012.

[ 2 ] D. Lin, S. Fidler, and R. Urtasun, "Holistic Scene Understanding for 3D Object Detection with RGBD Cameras," Proceeding of IEEE International Conference on Computer Vision, pp. 1417- 1424, 2013.

[ 3 ] J. Jeon, S. Cho, X. Tong, and S. Lee, "Intrinsic Image Decomposition Using Structure-Texture Separation and Surface Normals," Proceeding

of European Conference on Computer Vision, pp. 218-233, 2014.

[ 4 ] O. Wang, M. Lang, and M. Gross, "Stereo Brush: Interactive 2D to 3D Conversion using Discontinuous Warps," Proceedings of the 8th Eurographics Symposium on Sketch-based Interfaces and Modeling, pp. 47-54, 2011.

[ 5 ] H. Yuan, S. Wu, P. Cheng, P. An, and S. Bao,

"Nonlocal Random Walks Algorithms for Semi-Automatic 2D-to-3D Image Convert- sion," IEEE Signal P rocessing Letters, Vol.

22, No. 3, pp. 371-374, 2015.

[ 6 ] J. Atick, P. Griffin, and N. Redlich, "Statistical Approach to Shape fomr Shading: Recon- struction of Three-Dimensional Face Sur- faces from Single Two-Dimensional Images,"

Neural Computation, Vol. 8, No. 6, pp. 132- 1340, 1996.

[ 7 ] R. Szeliski and P. Torr, "Geometrically Con- strained Structure from Motion: Points on Planes," 3D Structure from Multiple Images of Large-Scale Environments, Springer Ber- lin Heideberg, pp. 171-186, 1998.

[ 8 ] A. Oliva and A. Torralba, "Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope," International Jour-

(a) (b) (c) (d) (e) (f)

Fig. 5. Comparison with competing algorithms (a) input image, (b) ground-truth, (c) Depth Fusion [18], (d) Make3D

[12], (e) Eigen et al. [13], (f) ours.

(9)

nal of Computer Vision, Vol. 42, No. 3, pp.

145-175, 2001.

[ 9 ] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection,"

P roceeding of IEEE Computer Vision and Pattern Recognition, Vol. 1, pp. 886-893, 2005.

[10] K. Karsch, C. Liu, and S. Kang, "Depth Ex- traction from Video Using Non-parametric Sampling," P roceeding of European Confer- ence on Computer Vision, pp. 775-788, 2012.

[11] S. Choi, D. Min, B. Ham, Y. Kim, C. Oh, and K. Sohn, "Depth Analogy: Data-Driven Approach for Single Image Depth Estimation using Gradient Samples," IEEE Transaction on Image Processing, Vol. 24, No. 12, pp.

5953-5966, 2015.

[12] A. Saxena, M. Sun, and A. Ng, "MAKE3D:

Learning 3D Scene Structure from a Single Still Image," IEEE Transaction on Pattern Analysis machine Intelligence, Vol. 31, No. 5, pp. 824-840, 2009.

[13] D. Eigen, C. Puhrsch, and R. Fergus, "Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network," Advances in Neural Information Processing Systems, pp.

2366-2374, 2014.

[14] F. Liu, C. Shen, and G. Lin, "Deep Convolutio- nal Neural Fields for Depth Estimation from a Single Image," Proceeding of IEEE Conference on Computer Vision and Recog- nition, pp. 5162-5170, 2015.

[15] J. Long, E. Shelhamer, and T. Darrell, "Fully Convolutional Networks for Semantic Seg- mentation," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, 2015.

[16] C. Liu, J. Yuen, and A. Torralba, "SIFT Flow:

Depth Correspondence across Scenes and Its Applications," IEEE Transaction P atterns Analysis Machine Intelligence, Vol. 33, No.

33, pp. 978-994, 2011.

[17] C. Barnes, E. Shechtman, A. Finkelstein, and

D. Goldman, "PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing," ACM Transactions on Gra- phics, Vol. 28, No. 3, pp. 24, 2009.

[18] J. Konrad, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee, "Learning-based, Automatic 2D-to-3D Image and Video Conversion,"

IEEE Transaction Image Processing, Vol.

22, No. 9, pp. 3485-3496, 2013.

[19] L. C. Chen, G. Papandreou, I. Kokkonos, K.

Murphy, and A.L. Yuille, "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs," arXiv: 1412.7062, 2014.

[20] L. C. Chen, J. Barron, and A. L. Yulle,

"Semantic Image Segmentation with Task- Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform,"

arXiv: 1511.03328, 2015.

[21] F. Liu, C. Shen, G. Lin, and I. D. Reid,

"Learning Depth from Single Monocular Images using Deep Convolutional Neural Fields," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162-5170, 2015.

[22] W. Chen, Z. Fu, D. Yang, and J. Dong,

"Single-Image Depth Perception in the Wild", arXiv: 1604.03901, 2016.

[23] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Reconstruction," arXiv: 1409.1556, 2014.

[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L.

Fei-Fei, "ImageNet Large Scale Visual Re- cognition Challenge," International Journal of Computer Vision, Vol. 115, No. 3, pp. 211- 252, 2015.

[25] A. Krizhevsky, I. Sutskever, and G. Hinton,

"ImageNet Classification with Deep Convolu-

tional Neural Networks," Advances in Neural

Information Processing Systems, pp. 1097-

(10)

1105, 2012.

[26] J. Kim, J. Lee, and K. Lee, "Accurate Image Super-Resolution Using Very Deep Convolu- tional Networks," arXiv: 1511.04587, 2015.

[27] N. Silberman, D. Hoiem, P. Kohli, and R.

Fergus, "Indoor Segmentation and Support Inference from RGBD Images," Proceeding of European Conference on Computer Vision, pp. 746-760, 2012.

[28] S. Loffe and c. Szegedy, "Batch Normaliza- tion: Accelerating Deep Network Training by Reducing Internal Covariate Shift," Proceed- ing of International Conference on Machine Learning, pp. 448-456, 2015.

[29] J. Kopf, M.F. Cohen, D. Lischinski, and M.

Uyttendaele, "Joint Bilateral Upsampling,"

ACM Transactions on Graphics, Vol. 26, No.

3, pp. 96, 2007.

[30] K. He, J. Sun, and X. Tang, "Guided Image Filtering," Proceeding of European Confer- ence on Computer Vision, pp. 1-14, 2010.

[31] D. Lee, and S. Kwon, "A Recognition Method for Moving Objects Using Depth and Color Information," Journal of Korea Multimedia Society, Vol. 19, No. 4, pp. 681- 688, 2016.

[32] S. Kim, and H. Kang, "Semantic Segmenta- tion of Indoor Scenes Using Depth Superpix- el," J ournal of Korea Multimedia Society, Vol. 19, No. 3, pp. 531-538, 2016.

Single Image Depth Estimation With Integration of Parametric Learning and Non-Parametric Sampling

1. INTRODUCTION

lying the image.

There have been many attempts to estimate depth information from single images. We roughly classify the existing methods into two groups:

Single Image Depth Estimation With Integration of Parametric Learning and Non-Parametric Sampling

Hyungjoo Jung

, Kwanghoon Sohn

ABSTRACT

Using a parametric convolutional network, our approach learns the relation of various monocular cues, which make a coarse global prediction. We also leverage the local prediction to refine the global prediction.

It is practically estimated in a non-parametric framework. The integration of local and global predictions is accomplished by concatenating the feature maps of the global prediction with those from local ones.

Experimental results demonstrate that the proposed method outperforms state-of-the-art methods both qualitatively and quantitatively.

Key words: Depth Estimation, Convolutional Neural Network, Fully Convolutional Network, Non- Parametric Method

※ Corresponding Author : Kwanghoon Sohn, Address:

C129, The 3rd Engineering Building, Yonsei University, 50 Yonsei-ro, Seodaemun-Gu, Seoul 120-749, Korea, TEL : +82-2-2123-2879, FAX : +82-2-2123-7712, E-mail : [email protected]

Receipt date : May 26, 2016, Revision date : July 17, 2016 Approval date : Aug. 5, 2016

School of Electrical & Electronic Engineering, Yonsei University (E-mail: [email protected])

School of Electrical & Electronic Engineering, Yonsei University

※ This work was supported by the National Research

Foundation of Korea(NRF) grant funded by the Korea

government(MSIP) (NRF-2013R1A2A2A01068338)

reduce the number of user scribbles. However, scribble-based methods are not suitable for auto- matic vision-related tasks.

Contrary to semi-automatic approaches, auto- matic depth estimation methods do not require any user interactions. For example, they estimate depth automatically using depth cues, such as shape from shading [6], and structure from motion [7].

However, their applications are limited to restricted environments i.e., moving cameras or static scenes.

More recently, fully convolutional networks (FCNs) [15] are mainstream for depth estimation.

In this paper, we integrate non-parametric algo- rithm and CNNs for depth estimation from a single monocular scene. First, we estimate coarse depth map using FCN. Non-parametric sampling method

extracts warped depth maps, which their bounda- ries are aligned with query's ones. The warped depth maps are concatenated with the convolu- tional layer's activations of CNN for depth refine- ment. The network extracts a plausible depth prediction. This paper is organized as follows.

Section 2 presents related works. The proposed al- gorithm is explained in Section 3. Section 4 shows the implementation details and experimental results. Finally, we conclude the paper with some discussions and future works in Section 5.

2. RELATED WORKS

Recently, several methods have been proposed to directly estimate depth from a single image.

Among all the interesting works, we review and discuss two related works that are most relevant to ours.

2.1 Non-parametric Sampling Methods

adapt Patch Match [17] for efficient warping

process. The method, which is called Depth Anal-

ogy, estimates depth gradient rather than directly

estimate depth map. The final depth prediction is

obtained from Poisson reconstruction. The frame-

work is reasonable due to the statistical invariance property of depth gradients.

However, if the candidate depths are not locally consistent with query image, it might fail to esti- mate proper depth values from median.

2.2 Parametric Learning Methods

Another group of researchers focuses on defin- ing a parametric model between RGB and depth.

Saxena et al. [12] proposes Markov Random Field (MRF) model to estimate depth map from a single image. A set of plane parameters is trained to cap- ture a relationship between RGB and depth using large scale of RGB-D dataset.

CNN-based depth estimation methods have been studied a lot. Eigen et al. [13] builds a multi-scale CNN networks to estimate depth directly. The net- work consists of two deep neural network stacks.

More recently, fully convolutional networks (FCNs) [15] have become a popular choice in se- mantic segmentation. FCNs can generate effective features, and be trained end-to-end manner. How- ever, FCNs extract coarse output because the net-

work just resizes low resolution into high reso- lution using bilinear interpolation. Chen et al. [19]

Our proposed method is also based on FCN for depth estimation from a single monocular scene.

We integrate non-parametric sampling framework to refine coarse FCN output. Since we embed boundary information of non-parametric results into CNN for refinement, the final depth is able to contain both of global and local information.

3. PROPOSED METHOD

The proposed method consists of three compo- nents as follows: parametric model for coarse depth estimation, non-parametric sampling framework for warped depth maps, and CNN layers for depth refinement. The overall framework of the proposed method is shown in Fig. 1.

3.1 Parametric Depth Estimation Model: Fully Convolutional Network

The first component in the proposed method is

FCN, which produces a coarse depth map. We

modify VGG-16 net [23] to make FCN as shown

in Fig. 2. The upper layers in Fig. 2 come from

VGG-16 net. They consist of 10 convolutional, 10

ReLU, and 4 max-pooling layers. We also use pre-

trained parameters using ImageNet [24] as the ini-

tial ones. We concatenate last two features, which

have 1/8 and 1/16 scales of input image. The fea-

tures are resized into input's resolution using bi-

linear interpolation before concatenation. To obtain

better high-resolution prediction, we concatenate

the middle activations with the last ones. Since

middle activations perform less pooling process, they contain more structural information. Two convolutional layers are applied to obtain FCN output. We do not add ReLU after the last convolu- tional layer. We set a loss used in [13] as follows:

   

 



 



   

 



 

 

  

 

  



 





^

^{ }

^

^{ } _ ^

 ^

^ ^ ^

^

^{ }

^

^

^ ^ ^

^

^{ }

^

^

^ ^

(1) where ^ and ^ are FCN output and its ground- truth, respectively. In addition, ^

and ^

denote the horizontal and vertical difference matrix. ^ is the whole number of valid pixels. The first and second term are least square and scale-invariant difference, respectively. Also, the last term encour-

denote CNN descriptor of the image ^

in RGB-D database. Given the input im- age ^

and its descriptor ^

  ∥ ^

average relative error (rel) : _ _ ^ 

^ _

 ^

root mean squared error (rms) :  ^ ^ 

^

^{ }

^

^

(4) root mean squared in ^log error ( ^log ) :

^log

^

^{ log}