• 검색 결과가 없습니다.

Automatic Classification of Drone Images Using Deep Learning and SVM with Multiple Grid Sizes

N/A
N/A
Protected

Academic year: 2021

Share "Automatic Classification of Drone Images Using Deep Learning and SVM with Multiple Grid Sizes"

Copied!
8
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Automatic Classification of Drone Images Using Deep Learning and SVM with Multiple Grid Sizes

Kim, Sun Woong

1)

·Kang, Min Soo

2)

·Song, Junyoung

3)

·Park, Wan Yong

4)

· Eo, Yang Dam

5)

·Pyeon, Mu Wook

6)

Abstract

SVM (Support vector machine) analysis was performed after applying a deep learning technique based on an Inception-based model (GoogLeNet). The accuracy of automatic image classification was analyzed using an SVM with multiple virtual grid sizes. Six classes were selected from a standard land cover map. Cars were added as a separate item to increase the classification accuracy of roads. The virtual grid size was 2–5 m for natural areas, 5–10 m for traffic areas, and 10–15 m for building areas, based on the size of items and the resolution of input images. The results demonstrate that automatic classification accuracy can be increased by adopting an integrated approach that utilizes weighted virtual grid sizes for different classes.

Keywords : Deep Learning, Support Vector Machine, Drone Image, Automatic Image Classification

1. Introduction

Deep learning requires a large amount of data to accurately perform feature extraction. Even though geospatial images typically contain a wide variety of features and object information, the volume of readily available input data is often insufficient for effective deep learning. The advantage of deep-learning-based image classification is that once learning has been performed for an initial data set, a model can be retrained on additional datasets and classification accuracy can be further improved using the learnings gained from each additional dataset (Bengio, 2012). Recent advancements in the production of high-resolution satellite images, aerial images, and drone-based images have

provided new opportunities for deep-learning-based image classification (Song, 2019).

In particular, drone images can be easily acquired at low cost, in contrast with aerial and satellite images. Drone images are expected to become the main source of data for geospatial information products (Jo, 2017), and deep learning techniques may become the main classification tool for these images.

Ham (2019) investigated deep-learning-based classification for drone images. Building object features were extracted from pixel-based data to detect nonregistered buildings. In addition, a deep learning model was trained using orthorectified images, true orthoimages, and digital maps published by the National Geographic Information

Received 2020. 08. 29, Revised 2020. 09. 16, Accepted 2020. 10. 13

1) Member, Researcher, Department of Advanced Technology Fusion, Konkuk University, Seoul, Korea (E-mail: [email protected]) 2) Director, Jigusoft Inc, Seongnam, Korea (E-mail: [email protected])

3) Member, Undergraduate student, Department of Civil and Environmental Engineering, Konkuk University, Seoul, Korea (E-mail: songjy95@konkuk.

ac.kr)

4) Member, Senior Principal Researcher, Agency for Defense Development, Korea (E-mail: [email protected])

5) Member, Professor, Department of Civil and Environmental Engineering, Konkuk University, Seoul, Korea (E-mail: [email protected]) 6) Corresponding Author, Member, Professor, Department of Civil and Environmental Engineering, Konkuk University, Seoul, Korea (E-mail: neptune@

konkuk.ac.kr)

https://doi.org/10.7848/ksgpc.2020.38.5.407 Original article

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://

creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any

(2)

Institute and Spatial Information Industry Promotion Institute. The performance of the model was compared using different datasets. Joo (2017) used a deep learning CNN (convolutional neural network) to classify crop species from the drone images of agricultural fields. The fields were labeled using seven classes (rice, sweet potato, chilly, corn, sesame, fruit, and vinyl house). Labeling was carried out manually, whereas image processing and normalization were performed automatically. Results showed a classification accuracy of 98%. Kim et al. (2019) used Tiny-YOLO-V2 and Faster R-CNN to detect and analyze cracks on the surface of paved roads. Results showed that the Faster R-CNN algorithm was more effective at this task than the YOLO algorithm.

As a method of creating land cover maps for the Ministry of Environment, this paper proposes an efficient process for extracting features from a CNN and determining classes from drone images using a SVM (support vector machine).

As the land cover map classification system of the Ministry of Environment entails the attribute of land use (artificial classification with regard to human activities), it is expected that methods that perform classification based on various features, such as deep learning, will produce better results compared to a basic classification method based on pixel value distributions or textures.

An Inception model was trained with a limited number of patches obtained from drone images of an experimental area.

That is, no external training was used (such as a pretrained model or images patches from another area). Three categories and six classes were selected, namely, traffic areas (road and car classes), building areas (villa and building classes), and natural field areas (natural and man-made vegetation).

For these six classes, Inception-based deep learning was applied and an SVM classifier was used with multiple virtual grids. Additionally, a method was proposed for integrating the results obtained for different virtual grid sizes and analyzing the expected classification probability distribution.

2. Proposed Methodology

We propose a method of extracting objects from images using an Inception model developed using deep learning

and classifying the extracted objects to the class with the highest probability by utilizing SVM classifiers. This method combines the advantages of deep learning and machine learning. In addition, external elements other than elements implemented by CNNs can be applied to the classification.

Tang (2013) used an integrated approach to increase classification accuracy by applying machine learning either as a preprocessing step before deep learning or as a postprocessing step after deep learning.

Petrovska et. al. (2020) used a pretrained CNN to extract the deep features of aerial images from several CNN architectures. Then, they used an SVM for classifying concatenated features. Their results were comparable with the results of other state-of-the-art methods. Notley &

Magdon-Ismail (2018) extracted features from images and numeric data and used them as inputs for SVMs and KNN (k-nearest neighbor) classifiers to determine if neural- network-extracted features enhanced the capabilities of these models. This study explores the idea of replacing the typical softmax classier in a neural network with an SVM or a KNN classifier. The results of this study indicate that combining the features derived from neural networks with alternate classification models can provide high classification accuracy.

In a CNN, the performance of a model increases as the network becomes deeper with an increase in the number of layers. However, in such cases, the number of parameters for the learning process and the computation cost increase significantly. Further, if a model is trained with only a limited dataset, overfitting and vanishing gradient problems may occur owing to the nature of a deep network. To overcome these issues, Google proposed a model, named GoogLeNet, based on the concept of Inception (Szegedy et. al., 2015).

The Inception architecture uses Faster R-CNN as the base algorithm. However, this algorithm has high memory and GPU requirements. Thus, to overcome this drawback, Google repeated the layers of Inception multiple times to generate a 22-layer-deep model, referred to as the GoogLeNet model (Ren et. al., 2015).

Faster R-CNN is an extension of the R-CNN architecture that improves the inefficiency of R-CNN in three manners.

First, the image files of candidate regions are temporarily

(3)

stored in the memory as region coordinates rather than being stored in the disk space. Compared to using the entire image, this method allows for fast calculation because the coordinates require less storage space with the (x, y, w, h) format and can be stored in the memory. Second, features are extracted by applying a CNN to an entire image rather than to all candidate regions. The coordinate information of all candidate regions is adjusted according to the image size, which decreases as the input passes through the CNN. This significantly reduces the number of CNN operations. Third, a pooling technique that can process features of arbitrary sizes is used without warping. The performance of Faster R-CNN is improved by simultaneously training a classifier and regressor instead of training them separately (Girshick et al., 2015).

Fig. 1 . Spatial model developed using ERDAS spatial modeler

Fig. 1 illustrates the experimental model implemented using ERDAS IMAGINE.

The “Training Image Chips Location” stage involves generating a library for self-learning. The library comprises 60 images for each class among the six classes. This generated library is sent to the Inception stage. The existing Inception model is trained with a GoogLeNet library to construct a system. Therefore, in the Inception stage, a new learning dataset is generated using the library received from the preceding stage. The number of training steps is 1,000, the learning rate is 0.01, and the validation parameter is 10. The newly trained Inception model is obtained in the “Machine Intellect Output” stage. Then, the model is trained with the six newly defined classes. Additionally, the “Retrained Machine Intellect” stage is employed for comparison and verification with initialized data. In general, class allocation or labeling is performed by inputting the shapes of the features to be extracted. However, for the purpose of this study, three sizes

of a virtual lattice are generated and used for quantitative comparison with topic maps. Subsequently, the machine- learning-based classification step is designed to select the most suitable grid size for each class and to determine the probability that each pixel is accurate, according to the selected size. ERDAS is used to configure the SVM.

Compared to conventional neural network methods that minimize empirical errors, the SVM method attempts to minimize generalization errors by automatically tuning a few relevant parameters. The SVM method involves converting the data to a higher dimensional space to find the hyperplane that maximizes the distance between classes. The class data points that are on or the closest to the hyperplane are referred to as support vectors. The distance between a support vector and the hyperplane is referred to as the margin. The fundamental principle of the SVM model involves locating the hyperplane where the margin is maximized (Gang et al., 2012).

3. Experimental Data and Training

3.1 Experimental location

The area around Shingu College in Seongnam-si, Gyeonggi- do, was selected as the experimental location because drone images are available for this area and it is characterized by an even distribution of residential, commercial, traffic, man-made grass field, and natural grass field areas. The images used in the experiment were obtained in 2017 at a ground sampling distance of 5 cm by a drone equipped with a senseFly camera sensor. WPS (Web processing service) model was constructed using a spatial modeler in ERDAS IMAGINE. The model automatically processes and classifies drone images by establishing the processes of self-learning and automated image classification.

The level-2 land cover map published by the Korean

Ministry of Environment in 2010 was used for selecting

object classification items and comparing results. Fig. 2

displays the level-2 land cover map of the experimental area.

(4)

(a) Land cover map (b) Training data location (110: Residential Area, 130: Commercial Area, 140:

Amusement Area, 150: Road, 160: Public Area, 410: Natural Vegetation, 420: Man-made Vegetation, 620: Bare Land)

Fig. 2 . Experimental area

The experiment was conducted according to the flowchart shown in Fig. 1, and the sizes of the virtual grid were configured to be similar to the average size of each class of self-learning data. The sizes of the classification grid were recommended by the ERDAS autogrid function as 10–15 m each for villas and buildings (residential areas), 5–10 m each for roads and cars (traffic areas), and 2–5 m each for natural and man-made vegetation (grass field areas).

These grid sizes were recommended based on the spatial size and quantity of learning data and the spatial resolution of classification images. In this study, the grid size was 13.1 m for the villa and building classes, 6.55 m for the car and road classes, and 4.36 m for the natural and man-made vegetation classes.

Classification accuracy was analyzed based on the classification results and the classification probability for each grid. The level-2 land cover map provided by the Korean Ministry of Environment was used as reference data.

3.2 Training dataset

The items for the classification of images were selected based on the level-2 land cover map. The model was trained with the abovementioned six classes to clearly determine the classification performance of the target area among all layers.

Further, the model was additional trained on the car class to increase the accuracy of classifying roads.

The selection and amount of training data are considerably important in conventional and deep-learning-based classification models. The advantage of deep-learning-based

classification is that it requires less training data compared to machine learning. However, in this study, training data were selected based on the locations that revealed the characteristics of classification items in detail to verify the feasibility of using only self-learning data. Accordingly, approximately 50 drone images were extracted for each class and used as training data. Various shapes and structures were applied as features in the case of man-made vegetation, and broad-leaved and coniferous trees were evenly applied as features in the case of natural vegetation.

(a) Traffic area (b) Natural field

area (c) Building area Fig. 3. Drone images used as training data

4. Results and Discussion

4.1 Classification probability according to grid size for each class

The distribution of classification probability was analyzed according to the grid size allocated to each class. Fig. 4 illustrates the classification results for a grid size of 13.1 m.

Fig. 4(a) depicts the superimposition of the classification results on the land cover map, and Fig. 4(b) depicts classification probability. Here, classification probability refers to the predicted probability values derived from the classification based on the machine learning model.

(a) Classification results (b) Classification probability

Fig. 4 . Classification results for a grid size of 13.1 m

(5)

The classification probability for 84% of 1,295 grids was 0.3–0.8, and it was more than 0.9 for 39 grids. The average classification probability of the grids classified as natural field areas was the highest, at more than 0.6. This indicated that the results did not match the recommended training image size. In particular, the average classification probability of the grids classified as buildings was the lowest, at 0.39. This is expected to adversely affect the classification accuracy of buildings.

(a) Classification results (b) Classification probability Fig. 5 . Classification results for a grid size of 6.55 m Fig. 5 shows the classification results for a grid size of 6.55 m. This grid size yielded the highest classification probability.

Classification probability was evenly distributed among all classes, and an average classification probability of 0.7 or more was observed in all classes except the building class. In particular, out of 5,180 grids, the classification probability for 2,284 grids was 0.9 or higher. In addition, the classification probability for natural field areas was the highest, regardless of the recommended grid size. Further, high classification probability was observed in traffic areas, as expected.

(a) Classification results (b) Classification probability Fig. 6 . Classification results for a grid size of 4.36 m

Fig. 6 illustrates the classification results for a grid size of 4.36 m. Even though this was the smallest grid size, its classification probability was relatively higher than that for a grid size of 13.1 m. This was because classification probability inevitably decreased as more classes were combined with an increase in the grid size. Thus, smaller grid sizes tended to have a higher probability of containing more class attributes. However, the overall distribution of the classification probability at a grid size of 4.36 m was totally lower than that at a grid size of 6.55 m. This was because the class attributes of the building and traffic areas were similar at a grid size of 4.36 m. Hence, there were several cases where a building area was misclassified as a traffic area.

Therefore, the classification probability at a grid size of 4.36 m was lower than that at a grid size of 6.55 m.

4.2 Classification accuracy for different grid sizes

To conduct a one-to-one quantitative comparison, the land cover map was reclassified into three categories—traffic area, natural field area, and building area—in accordance with the grid classification criteria, as summarized in Table 1.

Table 1 . Mapping between classification categories and land cover map codes

Land Cover Map Accuracy Evaluation

Category

Classification Item Man-made Vegetation

(420) Natural Field

Area Grass

Natural Vegetation (410) Forest

Road (150)

Traffic Area Road

Bare Land (620) Car

Residential Area (110)

Building Area

Villa

Commercial Area (130) Villa

Public Area (160) Building

(6)

(a) Grid classification map (b) Land cover map Fig. 7. Grid classification map (6.55 m)

Building Area Land_use

Building

Area Traffic

Area Natural Field

Area Sum User’s

Deep

Building Area 325 121 33 479 67.849%

Traffic Area 45 227 14 286 79.370%

Natural Field Area 17 252 261 530 49.245%

Sum 387 600 308 1,295

Producer’s 83.979% 37.833% 84.740% 62.780%

Table 2 . Classification error matrix for grid size of 13.1 m

Table 3 . Classification error matrix for grid size of 6.55 m

Table 4 . Classification error matrix for grid size of 4.36 m

Building Area Land_use

Building

Area Traffic

Area Natural Field

Area Sum User’s

Deep

Building Area 723 180 52 955 75.706%

Traffic Area 686 1386 94 2,166 63.988%

Natural field Area 149 831 1079 2,059 52.404%

Sum 1,558 2,397 1,225 5,180

Producer’s 46.406% 57.822% 88.082% 61.544%

Even though we expected a grid size of 6.55 m to yield the highest classification accuracy, the classification accuracy for a grid size of 13.1 m was better than that for a grid size of 6.55 m, as shown in the table 2 and table 3. In the case of natural field areas, the highest user’s accuracy was obtained at a grid size of 6.55 m. This result was slightly different from that of classification probability. In particular, for a grid size of 4.36 m, the producer’s accuracy in building areas was significantly lower than that in other areas. In the classification results for different grid sizes, it is noteworthy that the deviation in producer’s accuracies was considerably larger than that in user’s accuracies.

Building Area Land_use

Building

Area Traffic

Area Natural Field

Area Sum User’s

Deep

Building Area 404 117 40 561 72.014%

Traffic Area 2933 3823 392 7,148 53.483%

Natural Field Area 169 1432 2345 3,946 59.427%

Sum 3,506 5,372 2,777 11,655

Producer’s 11.523% 71.165% 84.443% 56.388%

(7)

4.3 Integration of virtual grid sizes considering recommended training space size

After rasterizing the data for each grid size, reclassification was conducted to extract only the most relevant categories for each recommended grid size. For reclassification, the maximum method in Mosaic operator at Mosaic tool of ArcGIS was utilized to assign priority to the pixels that overlapped with the new raster merging task in the mosaic.

Further, as a preprocessing method, the data were reclassified and labeled as follows: villa/building = 2, car/road = 1, tree/

forest = 3, and the remaining content excluding the main content = 0.

The criteria for assigning labels 1, 2, and 3 were based on the order of classification probability in the classification results for each grid size. Thus, natural areas were assigned the highest priority, followed by building and traffic areas.

The results showed that the integrated approach was more accurate than the single grid size approach(Table 5).

5. Conclusion and Future Work

Inception-based deep learning was applied to drone images, and virtual-grid-based SVM learning was used for object classification. In addition to conventional deep learning, training and classification were applied to given images, similar to statistical classification. Results showed that the predicted classification probability and actual classification accuracy did not agree for a single size of the virtual grid. Classification accuracy was improved using an integrated approach with weighted virtual grid sizes for each class. Classification accuracy was less than 70%. A few characteristics of buildings area were lost while preparing

Building Area Land_use

Building

Area Traffic

Area Natural Field

Area Sum User’s

Deep

Building Area 2629 1019 195 3843 68.410%

Traffic Area 460 2388 148 2996 79.706%

Natural Field Area 326 1898 2412 4636 52.028%

Sum 3415 5305 2755 11475

Producer’s 76.984% 45.014% 87.550% 64.741%

Table 5 . Classification error matrix for integrated approach

the initial deep learning data for a grid size of 4.36 m. Thus, a few building areas were misclassified as traffic areas.

As conventional methods employ images corresponding to a given area for training and classification, the spatial image range for learning was limited to the experimental area employed in this work. The classification accuracy obtained via the experiments was similar to that of conventional methods. This indicates that classification accuracy can be significantly improved if learning is performed using more images.

Further studies are required for determining the priorities corresponding to different virtual grid sizes and method for expressing the classification results with a fixed grid size.

Acknowledgment

This work was partly supported by the Technology development Program of MSS G21S2759510 and the Startup Growth Technology Development Project G21S2759510

References

Song, A.R. (2019), A Novel Deep learning Framework for Multi-Class Change Detection of Hyperspectral Images, Ph.D. Dissertation, Seoul National University, 13p. (in Korean with English abstract)

Song, A.R. and Kim, Y.I. (2017), Deep Learning-based Hyperspectral Image Classification with Application to Environmental Geographic Information Systems, Korean Journal of Remote Sensing, Vol.33, No.6-2, pp.1061~1073.

(in Korean with English abstract)

Joo, Y.D. (2017), Drone Image Classification based on

(8)

Convolutional Neural Networks, The Journal of The Institute of Internet, Broadcasting and Communication (IIBC) , Vol. 17, No. 5, pp.97-102. (in Korean with English abstract)

Jo, H.J. (2017), A Study on Image Acquisition and Image Processing using Drones, Master thesis, Kyungpook University, 83p. (in Korean with English abstract) Bengio, Y. (2011), Deep learning of representations for

unsupervised and transfer learning, Proceeding of ICML workshop on unsupervised and transfer learning-2011, 02 July, Bellevue, Washington, USA Vol. 27, pp.17-36.

Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014), Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE conference on computer vision and pattern recognition-2014, 23-28 June, Columbus, OH, USA , pp.580-587.

Girshick, R. (2015), Fast r-cnn, IEEE conference on computer vision and pattern recognition - 2015, 07-13 Dec, Santiago, Chile, pp.1440-1448.

Kim, J.M., Hyeon. S.G., Chae, J.H., and Do, M.S. (2019), Road Crack Detection based on Object Detection Algorithm, using Unmanned Aerial Vehicle Image, J. Korea Inst.

Intell. Transp. Syst, Vol.18 No.6, pp.155-163. (in Korean with English abstract)

Kang, N.Y., Pak, J.G., Cho, G.S. and Yeu, Y. (2012), An Analysis of Land Cover Classification Methods Using IKONOS Satellite Image, Journal of Korean Society for Geospatial Information Science , Vol.20 No.3, pp.65-71.

(in Korean with English abstract)

Ham, S.W. (2019), Semantic Segmentation of Drone Images Using Deep Learning - Focusing on Illegal Building Monitoring-, Master’s Thesis at University of Seoul, 17p.

(in Korean with English abstract)

Christian, S., Wei, L., Yangqing, J., Pierre, S., Scott, R., Dragomir, A., Dumitru, E., Vincent, V., Andrew, R.

(2015), Going Deeper with Convolutions, The proceeding of CVPR : 28th IEEE Conference on Computer Vision and Pattern Recognition-2015, 7-12 June, Boston, MA, USA, pp.1-9.

Tang, Y. (2013), Deep Learning using Linear Support Vector Machines, International Conference on Machine Learning: Challenges in Representation Learning

Workshop-2013.

Ren,S., He, K., Girshick, R. and Sun, J. (2015), Faster R-CNN:

Towards Real-Time Object, Detection with Region Proposal Networks, Advances in Neural Information Processing Systems 28 Proceeding- 2015, Dec, Montréal CANADA. pp. 91–99.

Peng, C.-Y. J., Lee, K. L. and Gary, M.Ingersoll. (2002), An Introduction to Logistic Regression Analysis and Reporting, The Journal of Educational Research. Vol.96 No.1, pp.3-14.

Biserka Petrovska, Eftim Zdravevski, Petre Lameski, Roberto Corizzo, Ivan Štajduhar and Jonatan Lerga, (2020), Deep Learning for Feature Extraction in Remote Sensing: A Case-Study of Aerial Scene Classification, Sensors, 20(14), 3906; doi:10.3390/s20143906

Notley, S., Magdon-Ismail, M., (2018), Examining the use

of neural networks for feature extraction: A comparative

analysis using deep learning, support vector machines,

and k-nearest neighbor classifiers. arXiv preprint

arXiv:1805.02294.

수치

Fig. 1 . Spatial model developed using ERDAS spatial  modeler
Fig. 2 . Experimental area
Table 1 . Mapping between classification categories and  land cover map codes
Table 2 . Classification error matrix for grid size of 13.1 m
+2

참조

관련 문서

● Define layers in Python with numpy (CPU only).. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 150 April 27,

: Model Parallelism in Deep Learning is NOT What You Think : Paper, Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platfrom,

If local computing power is selected, the drone platform runs the standard q-learning prediction algorithm and updates the Q-table, then reads the sensor's SINR data,

Preliminary Study of IR Sensor Equipped Drone-based Nuclear Power Plant Diagnosis Method using Deep Learning. Ik Jae Jin and

Motivation – Learning noisy labeled data with deep neural network.

High school Japanese I textbooks were analyzed based on the classification of culture types based on Finocchiaro & Bonomo, Chastain's theory and

 If we wish to develop a learning process based on

This study aims to analyze the process and the features of corporate e-Learning as a form of informal learning. For this purpose, the concept of