Parking Lot Vehicle Counting Using a Deep Convolutional Neural Network

(1)

Deep Convolutional Neural Network를 이용한 주차장 차량 계수 시스템

Parking Lot Vehicle Counting Using a Deep Convolutional Neural Network

림 쿠이 송*․권 장 우**

* 주저자 : 인하대학교 컴퓨터공학과 석사과정

** 교신저자 : 인하대학교 컴퓨터공학과 교수

Kuoy Suong Lim*․Jang woo Kwon**

* Dept. of Computer Eng., Univ. of Inha

** Dept. of Computer Eng., Univ. of Inha

†Corresponding author : Jang woo Kwon, jwkwon@inha.ac.kr

Vol.17 No.5(2018) October, 2018 pp.173~187

요 약

본 논문에서는 주차장 관리 시스템의 한 부분으로 차량 계수를 위한 감시 카메라 시스템의 컴퓨터 비전과 심층 학습 기반 기법을 제안하고자 한다. You Only Look Once 버전 2 (YOLOv2) 탐지기를 적용하고 YOLOv2 기반의 심층 컨볼루션 신경망(CNN)을 다른 아키텍처와 두 가지 모델로 구성하였다. 제안 된 아키텍처의 효과를 Udacity의 자체 운전 차량 데이터 세트를 사용 하여 설명하였다. 학습 및 테스트 결과, 자동차, 트럭 및 보행자 탐지 시 원래 구조(YOLOv2)의

경우 47.89%의 mAP를 나타내는 것에 비하여, 제안하는 모델의 경우 64.30 %의 mAP를 달성하

여 탐지 정확도가 향상되었음을 증명하였다.

핵심어 : 주차장 관리, 물체 감지, 컴퓨터 비전, 기계 학습, deep convolutional neural network, 감시 카메라.

ISSN 1738-0774(Print) ISSN 2384-1729(On-line) https://doi.org/10.12815/kits.

2018.17.5.173

Received 18 September 2018 Revised 10 October 2018 Accepted 16 October 2018

ABSTRACT

This paper proposes a computer vision and deep learning-based technique for surveillance camera system for vehicle counting as one part of parking lot management system. We applied the You Only Look Once version 2 (YOLOv2) detector and come up with a deep convolutional neural network (CNN) based on YOLOv2 with a different architecture and two models. The effectiveness of the proposed architecture is illustrated using a publicly available Udacity’s self-driving-car datasets. After training and testing, our proposed architecture with new models is able to obtain 64.30% mean average precision which is a better performance compare to the original architecture (YOLOv2) that achieved only 47.89% mean average precision on the detection of car, truck, and pedestrian.

Key words : Parking lot management, Object detection, Computer vision, Machine learning, Deep convolutional neural network, Surveillance camera.

(2)

Ⅰ. INTRODUCTION

We estimated 600,000 surveillance cameras in Tianjin, China, and one camera produces around 50 petabytes of data every day (Xiao et al., 2015). High quality of camera’s resolution and large volume of video data generated by the long time span has put high pressure on data storage. For this reason, innovation in camera system is needed if we were to find or detect an object in 50 petabytes of video data in a single day.

A huge number of CCTV camera systems still require human supervision. A dramatic efficiency gains could be seen by utilizing the recent advancement in computer vision and artificial intelligence to embed in those camera systems so as to detect identity of human and ensure public security such as criminal activities prevention and investigation, accident monitoring, people protection, public properties guarding, and etc.

Parking management presents across a wide variety of industry, including universities, entertainment venues, hospitals, airports, convention centers, and public offices buildings. Hence, a lot of systems (Hasegawa et al., 1994;

Chen et al., 2010; Taghyaeeyan and Rajamani, 2014) exist to tackle the counting data issue in those large public venues, yet when introducing a new or installing an upgraded management system, it could result in significant expenses in servers, sensors, and network infrastructure. Additionally, the traditional sensors used to collect data information turn out to be either excessively costly, unreliable, and require extensive design work.

By utilizing modern deep CNNs, we could depend entirely on the image captured by the camera. This gives a couple of benefits. First, expensive sensors can be eliminated from the system. Secondly, reducing sensors would also result in removing the complexity of different data integration, resource sharing, computation power, battery life, and weight. Finally, relying solely on camera provides a great deal of installation flexibility. Installing a new camera counting system, moving a counting zone, or repointing the camera to define a new zone is relatively as simple as making some adjustment within the software program to suit the current camera view. Whereas using traditional in-ground sensor or beam-break devices generally calls for uninstalling after which reinstalling those devices at the new location. It is apparent that deep CNNs is an effective and efficient model to provide accurate, cost and energy-saving counting solution.

For this reason, in our study we would like to use deep CNNs together with the proposed models to detect car, truck, and pedestrian so as to be embedded in surveillance camera in a parking lot area.

Ⅱ. RELATED WORK

1. Traditional Approach

Traditionally, machine learning researchers approach image classification and image detection task using feature extraction for images in Global feature descriptors such as Local Binary Patterns (LBP) (Pietikäinen, 2010), Histogram of Oriented Gradients (HoG) (Dalal and Triggs, 2005), Color Histograms as well as Local descriptors such as Scale-invariant feature transform (SIFT) (Lindeberg, 2012), Speeded up robust features (SURF) (Bay et al., 2006), Oriented FAST and Rotated BRIEF (ORB) (Rublee et al., 2011) etc. However, these are hand-crafted features that requires people who has expertise level in that domain. Furthermore, the high variance of the nature of

(3)

images such as scale, illumination, rotation, deformation, occlusion, and viewpoint are presented as an obstacle for research to discover new and better algorithm that can outperform those traditional approaches.

2. Deep CNN-based Object Detection

In recent years, deep learning has become best known for its ability to learn from experience, and is used in complex problems. Noticeably, deep convolutional neural networks (CNNs) have made tremendous progress in large-scale object recognition (He et al., 2016; Krizhevsky et al., 2012; Szegedy et al., 2015) and in detection problems (Ren et al., 2015; Liu et al., 2016; Redmon and Farhadi, 2017).

In the endeavor to accomplish a fully autonomous vehicle, a lot of computer science researchers have applied a deep CNN to extract information about the road and to understand the environment surrounding the vehicle, ranging from detecting pedestrians (Angelova et al., 2015), cars (Zhou et al., 2016), and bicycles, to detecting road signs (John et al., 2014) and obstacles (Hadsell et al., 2009).

CNNs were also applied to address the issue of counting objects in images. Onoro-Rubio and López-Sastre (Onoro-Rubio and López-Sastre, 2016) proposed a convolutional neural network model called Counting CNN (CCNN) to estimate the number of vehicles in a traffic congestion and to count people in a very crowded scene.

CCNN works as a regression model that learns to map the appearance of the image patches to their corresponding object density maps. (Zhang et al., 2015) also propose a CNN architecture to predict density maps for crowd counting by a switchable learning process.

Ⅲ. METHODOLOGY

1. YOLO Object Detection

YOLO, short for You Only Look Once (Redmon et al., 2016), is an object detector focused on real-time processing. It takes a different approach than other networks that use region proposal or sliding window, instead it re-frames object detection as a single regression problem. YOLO looks at the input image just once, and divide it into grid of S x S cells. Each grid cell predicts B bounding boxes and a confidence score representing the IOU with ground truth bounding box and probability reflects how likely that the predicted bounding box contains some objects and how accurate it is:

  Pr_{ }^{  } (1)

_{ }^{  } denotes as the intersection over the union between the predicted box and the ground truth. Each cell also predicts C conditional class probabilities, Pr(object). Both the confidence score and class prediction outputs one final score telling us the probability that this bounding box contains a specific type of object.

(4)

2. YOLOv2 Architecture (S1)

YOLOv2 (Redmon and Farhadi, 2017) is the second version of YOLO with significantly improve in accuracy and speed. There are two different model architectures we used in conducting our training. The first experiment we conducted was based on darknet architecture of YOLOv2, <see Table I>. The total number of layers of YOLOv2 is 31, in which 23 are convolutional layers with a batch normalization layer before leaky ReLu activation and a maxpool layer at the 1st, 3rd, 7th, 11st, and 17th layer. The first thing we have to do to train the network on our own dataset is to reinitialize the final convolutional layer so that it outputs a tensor with shape 13 × 13 × 30, where 30 = 5 bounding boxes × (4 coordinates + 1 confidence value + 1 class probability).

Layer Type Filters Size/Pad/Stride Output

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Convolutional Maxpool Convolutional

Maxpool Convolutional Convolutional Convolutional Maxpool Convolutional Convolutional Convolutional Maxpool Convolutional Convolutional Convolutional Convolutional Convolutional Maxpool Convolutional Convolutional Convolutional Convolutional Convolutional Convolutional Convolutional Route [16]

Convolutional Reorganize Route [27] [24]

Convolutional Convolutional

32 64 128

64 128 256 128 256 512 256 512 256 512 1024 512 1024 512 1024 1024 1024 512 64 256 1280 1024 30

3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 3 x 3 / 1 / 1 3 x 3 / 1 / 1 1 x 1 / 0 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1

416 x 416 208 x 208 208 x 208 104 x 104 104 x 104 104 x 104 104 x 104 52 x 52 52 x 52 52 x 52 52 x 52 26 x 26 26 x 26 26 x 26 26 x 26 26 x 26 26 x 26 13 x 13 13 x 13 13 x 13 13 x 13 13 x 13 13 x 13 13 x 13 13 x 13 26 x 26 26 x 26 13 x 13 13 x 13 13 x 13 13 x 13

<Table 1> S1’s layer Architecture. Table adapted from YOLO9000 (Redmon and Farhadi, 2017)

3. Our Proposed Architecture (S2)

In an attempt to minimize the computational expenses and model size of the neural network, we proposed the

(5)

second model (S2 architecture), as seen in <Table 2>. While requiring just 18 million parameters with 27 layers, S2 is a modified version of the S1 architecture which needs roughly 48 million parameters. The final result shows that the average precision and recall score of the S2 architecture is similar as that of S1 architecture while being smaller in size and faster in computation speed.

1) Anchor Box Model

Generally, in object detection techniques, only one object can be detected in each of the generated grid cells.

The issue emerges when more than one object exists in each cell. The solution when dealing with this circumstance is using the idea of anchor boxes. Instead of predicting a one-dimensional 5 + num_of_class, it instead predicts (5 + num_of_class) × num_of_anchorboxes. Each anchor box is designed for detecting objects of different aspect ratios and sizes. For example, box 1 can detect objects that are large and tall, whereas box 2 detects objects with a small square shape, and so on.

YOLOv2 divided the entire image into a 13 x 13 grid, and places 5 anchor boxes. The bounding box and class predictions are then made for each anchor box located there. The appropriate bounding box is selected as the bounding box with the highest intersection over union (IOU) between the ground truth box and the anchor box.

The unique feature of YOLOv2 is that the anchor boxes are designed specifically for the dataset it trains on. The anchor boxes originally provided by YOLOv2 are for general objects in the Visual Object Classes (VOC) dataset, not “car”, “truck”, or “pedestrian” shapes and sizes. For this reason, we ran a k‑means clustering technique on our training set to generate five different anchor boxes tailored more towards our dataset objects.

Input: K number of cluster centroids and training set {^{ }^{ }^{ }}

Randomly initialize K cluster centroid __ _∈^ Repeat {

for i = 1 to m

c(i): = index (from 1 to K) of cluster centroid closed to x(i) for k=1 o K

_ : = average (mean) of points assigned to cluster k }

Although choosing a larger value of K (centroid) gives a higher average IOU, it will slow down the model since we also need more detectors in each grid cell. YOLOv2 choses 5 anchors as it is the optimal choice for a good trade-off between recall and model complexity.

Anchor Box Set Width Height

Set 1 0.364515 0.565825

Set 2 0.829012 0.974748

Set 3 1.498756 1.903202

Set 4 2.727905 3.818290

Set 5 4.811523 6.621144

<Table 2> New Anchor Box sets generated by k-mean clustering on the training dataset.

(6)

<Fig. 1> Visualization of 5 predicted anchor boxes with respect to the grid cell

2) Denser Grid Model

The dataset we used comes with a big resolution of 1920 x 1200. The YOLOv2 network operates at a network resolution of 416 x 416, and after its convolutional layers downsample the images by a factor of 32, the grid cell (output feature map) is 13 x 13 <see Fig. 2>. Unlike the desired square input resolution of the inspired model (S1), our dataset images’ resolution is large and wide. However, as this is a very high resolution, we decided to use an input network resolution of 960 x 608, which is half of the original resolution and will produce a 30 x 19 grid after downsampling <see Fig. 3>.

<Fig. 2> 13 x 13 grid cell of output feature map <Fig. 3> 30 x 19 gird cell of output feature map

3) S2-Anchor and Den-S2-Anchor

We made a few modifications to S1 to build the S2 architecture.

∙23rd and 24th layer of S1 consumes more than 18 million parameters. Layer 29 alone requires more than 11 million parameters. Hence, removing these three convolutional layers can reduce about 30 million parameters.

∙We make S2’s 23rd layer to have a filter size of 2048 by modifying S1’s 26th convolutional layer from 64 filters to 256 filters.

(7)

∙The reorganized 25th layer has a depth of 1024 by reorganizing the 24th layer from 26 x 26 x 256 to 13 x 13 x 1024.

∙Route layer 26 and route layer 25 (13 x 13 x 1024) with the 22nd convolutional layer (13 x 13 x 1024) output 13 x 13 x 2048.

∙We developed 2 different model for S2. The first one is the use of Anchor Box Model. we created S2-Anchor by modifying the width and height of S1’s anchor boxes. By applying Denser Grid Model on top of that, we created Den-S2-Anchor as the second model.

Layer Type Filters Size/Pad/Stride Output

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Convolutional Maxpool Convolutional

Maxpool Convolutional Convolutional Convolutional Maxpool Convolutional Convolutional Convolutional Maxpool Convolutional Convolutional Convolutional Convolutional Convolutional Maxpool Convolutional Convolutional Convolutional Convolutional Convolutional Route [16]

Convolutional Reorganize Route [25] [22]

Convolutional

32 64 128

64 128 256 128 256 512 256 512 256 512 1024 512 1024 512 1024 512 256 1024 2048 30

3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 2 x 2 / 0 / 2 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 1 x 1 / 0 / 1 3 x 3 / 1 / 1 1 x 1 / 0 / 1 2 x 2 / 0 / 2 1 x 1 / 0 / 1

416 x 416 208 x 208 208 x 208 104 x 104 104 x 104 104 x 104 104 x 104 52 x 52 52 x 52 52 x 52 52 x 52 26 x 26 26 x 26 26 x 26 26 x 26 26 x 26 26 x 26 13 x 13 13 x 13 13 x 13 13 x 13 13 x 13 13 x 13 26 x 26 26 x 26 13 x 13 13 x 13 13 x 13

<Table 3> S2’s layer Architecture. Table adapted from YOLO9000 (Redmon & Farhadi, 2017)

Ⅳ. EXPERIMENTS

1. Data Collection and Processing

We use the annotated driving dataset provided by Udacity’s Self-driving car (Udacity, 2018) that consists of driving in Mountain View California and neighboring cities during daylight conditions. It contains over 65,000

(8)

labels of cars, truck, and pedestrian across 9,423 frames collected from a Point Grey research cameras running at full resolution of 1920 x 1200. For our study, we used only 7,423 images for training set, and 1,000 images for testing set. Below is the summary of our training and testing set.

Dataset Car Labels Truck Labels Pedestrian Labels Total Labels Total Images

Training Set 46,446 3,051 3,593 53,090 7,423

Testing Set 7,699 390 1,691 9,780 1,000

<Table 4> Training and Testing Dataset Distribution

2. Training

Our implementation on S1, S2-Anchor, and Den-S2-Anchor architectures is based on an open source YOLO framework called Darkflow (Thtrieu, 2016). We trained all three models using pre-trained weights on PASCAL VOC and/or Microsoft Common Objects in Context (MS COCO) due to the fact that the pre-trained weights in deep CNNs had been trained on general purpose object detection, so it has the ability to learn low and middle level features such as Corners, Edges, color, and shape in a hierarchical fashion.

Training was performed on a GeForce GTX 1080 with 10 GB RAM. In all our training sessions, we used the Adam optimizer because of its tendency towards expedient convergence.

Training the S1 model started with a learning rate of 1e - 5 to quickly reduce loss. After training for 30 epochs, we validated the training with test images that the model had never seen; the performance was not good, with a number of false positive and false negative bounding boxes. For this reason, we changed the learning rate to 1e - 6 for another 30 epochs so as to ensure finer granularity and proper convergence of our models.

Like S1, we trained the S2-Anchor and Den-S2-Anchor models at the same learning rates of 1e - 5 and 1e - 6 for 30 epochs and 30 epochs, respectively. As far for as the validation process, we had Den-F2-Anchor train for another 30 epochs at the 1e - 6 learning rate. However, the model’s performance got slightly worse from overfitting, so we reverted to the previous checkpoint. Training for the three models was done in about four days.

<Fig. 4-6> show all three models’ training loss.

<Fig. 4> S1 model’s loss value of training for 60 epochs

(9)

<Fig. 5> S2-Anchor model’s loss value for 60 epochs

<Fig. 6> Den-S2-Anchor model’s loss value for 60 epochs

3. Parking Lot Environment Simulation

The parking lot environment <see Fig. 7> was built by using Lego bricks, motors, and Lego Mindstorms EV3 to power the conveyor belt. The decision to use Lego Mindstorms EV3 was because of its ability to make buildings or any kind of simulation environment simple, fast, and powerful. Programming and commanding was also relatively efficient (Valk, 2014).

As for detection technique, we used ArUco as a detection Markers to verify if the objects (car, truck, or pedestrian) have entered the area to be counted. ArUco (OpenCV, 2018) is a library for camera pose estimation using squared markers which composed by a wide black border and an inner binary matrix which determines its identifier (id). ArUco is extremely fast as the black border facilitates its fast detection in the image. Since car and truck have a wider width, we decided to use 2 different ArUco Markers mounted on 2 pole Lego pointed right in front of the camera with a conveyor belt carrying detected object running in between.

The camera was designed and coded to detect the ArUco Markers in real time, so whenever an object fully blocks the 2 markers, our proposed models that we embedded in the camera will fire to detect the object as well as start counting if the object is either car, truck, or pedestrian.

(10)

However, to guarantee the accurate quantity of accessible parking space, the counting system is programmed to count only those vehicles that will potentially take a parking spot (cars, trucks) and ignore those objects that may accidentally go through the counting area and incapable of taking a parking area (people, animals, bicycle, motorcycle, and so on).

<Fig. 7> Parking lot environment simulation using Lego bricks, motors, and Lego Mindstorms EV3

Ⅴ. RESULTS

We evaluated the performance of the networks using precision and recall score with the following formulas:

   

  





(2)

(11)

_{  }



  





(3)

TP, FP, and FN denote true positives, false positive, and false negative respectively. n represents the total number of testing images. In the testing phase, we tested the model against 1000 images. We compute the average precision for all the models.

Model Car Truck Pedestrian Mean Average

Precision Parameters Frames Per Second

S1 (YOLOv2) 0.76 0.33 0.35 47.89% 48 million 23

S2-Anchor (Ours) 0.74 0.33 0.31 45.79% 18 million 32

Den-S2-Anchor (Ours) 0.82 0.44 0.66 64.30% 18 million 21

<Table 5> Result of S1, S2-Anchor, and Den-S2-Anchor model’s performance

Based on the result table, we can see that Den-S2-Anchor model produced the highest relative mean average precision of all the three objects. Although S1 (YOLOv2) is a promising model, it does not perform well on real-life application of images with high resolution. While S2-Anchor, on the other hand, obtained similar mean average as the inspired model (S1) with a much smaller size of parameters. S2-Anchor processed the images the fastest at 0.032s (32 FPS). S1 came second, running at 0.023s (23 FPS), while Den-S2-Anchor came last at 0.021s (21 FPS).

The results show that despite a little compromise on speed, our proposed Den-S2-Anchor models achieved better performance, with a large reduction in computational complexity and model size.

There are a couple of things we need to examine on the result outcome. Firstly, despite the fact that S2-Anchor has a huge reduction in the number of parameters and less layers compared to S1, the former can still produce a comparable outcome in precision. This is due to the use of pre-trained weight in the training phrase. The models were able to learn well from the low and middle-level features. Secondly, “car” seems to have a higher accuracy result than “truck” and “pedestrian” object. The reason is because our datasets appear to have imbalanced object instance. The training set has 87.5%, 5.8%, and 6.7% for car, truck, and pedestrian respectively out of the total object instance while the same pattern could be seen: 78.7% for car, 4% for truck, and 17.3% for pedestrian out of the total object instance in testing set. This case could potentially influence our system to deliver a higher accuracy toward “car” object.

(12)

<Fig. 8> Examples of the detection of car, truck, and pedestrian objects. We can see that the model is able to detect the objects well under extensively varying real-world situations including occlusion, close-up, and various conditions of strong light spot and shadows changes.

(13)

<Fig. 9> Examples of the parking lot simulation environment for the counting system. The 3 objects instances were detected. However, only “car” and “truck” were counted while ignoring the “pedestrian” object as it will not take up a space at the parking lot.

Ⅵ. CONCLUSION AND FUTURE WORK

We presented a deep-learning architecture designed to detect car, truck, and pedestrian to be used in real-life application as well as build a simulation parking lot environment for the experiment. The new proposed models shrink the number of parameters by a large margin, with a notable increase in performance and speed. We demonstrated that deep-learning based vehicle counting is accurate, intelligent, and relatively easy to deploy than the traditionally used sensors. In spite of this study’s limitation, our research study could be of useful subset information on the further development of parking lot management.

In term of the future work, First, we would enhance the imbalanced object instances of other classes in the dataset. Secondly, we could train the system to count vehicles with other attributes such as vehicle types, and brand if such dataset is available. On top of that, we could also improve our model with other architectural approach and apply it on other type of vehicle dataset with various level of real-life difficulty. Finally, the same camera used to check the counting data, may likewise serve as a video feed to the security group to upgrade the overall system of parking lot management.

(14)

ACKNOWLEDGEMENTS

"This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-0-01642) supervised by the IITP(Institute for Information

& communications Technology Promotion)"

"This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2014-1-00729) supervised by the IITP(Institute for Information & communications Technology Promotion)"

REFERENCES

Angelova A., Krizhevsky A., Vanhoucke V., Ogale A. S. and Ferguson D.(2015, September),

“Real-Time Pedestrian Detection with Deep Network Cascades,” In BMVC, vol. 2, p.4.

Bay H., Tuytelaars T. and Van Gool L.(2006), “Surf: Speeded up robust features,” In European conference on computer vision, Springer, Berlin, Heidelberg, pp.404-417.

Chen L. C., Hsieh J. W., Lai W. R., Wu C. X. and Chen S. Y.(2010, October), “Vision-based vehicle surveillance and parking lot management using multiple cameras,” In 2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IEEE, pp.631-634.

Dalal N. and Triggs B.(2005), “Histograms of oriented gradients for human detection,” In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference, IEEE, vol. 1, pp.886-893.

Hadsell R., Sermanet P., Ben J., Erkan A., Scoffier M., Kavukcuoglu K. and LeCun Y.(2009),

“Learning long‐range vision for autonomous off‐road driving,” Journal of Field Robotics, vol.

26, no. 2, pp.120-144.

Hasegawa T., Nohsoh K. and Ozawa S.(1994), “Counting cars by tracking moving objects in the outdoor parking lot,” In Vehicle Navigation and Information Systems Conference, 1994.

Proceedings, IEEE, pp.63-68.

He K., Zhang X., Ren S. and Sun, J.(2016), “Deep residual learning for image recognition,” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770-778.

John V., Yoneda K., Qi B., Liu Z. and Mita S.(2014, October), “Traffic light recognition in varying illumination using deep learning and saliency map,” In Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference, IEEE, pp.2286-2291.

Krizhevsky A., Sutskever I. and Hinton G. E.(2012), “Imagenet classification with deep convolutional neural networks,” In Advances in neural information processing systems, pp.1097-1105.

Lindeberg T.(2012), Scale invariant feature transform.

Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C. Y. and Berg A. C.(2016, October), “Ssd:

Single shot multibox detector,” In European conference on computer vision, Springer, Cham, pp.21-37.

(15)

Onoro-Rubio D. and López-Sastre R. J.(2016), “Towards perspective-free object counting with deep learning,” In European Conference on Computer Vision, Springer, Cham, pp.615-629.

OpenCV: Detection of ArUco Markers. (n.d.), Retrieved September 11, 2018, from https://docs.opencv.

org/3.1.0/d5/dae/tutorial_aruco_detection.html.

Pietikäinen M.(2010), “Local binary patterns,” Scholarpedia, vol. 5, no. 3, p.9775.

Redmon J. and Farhadi A.(2017), “YOLO9000: better, faster, stronger,” arXiv preprint. Thtrieu.

darkflow. https://github.com/thtrieu/ darkflow, 2016.

Redmon J., Divvala S., Girshick R. and Farhadi A.(2016), “You only look once: Unified, real-time object detection,” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.779-788.

Ren S., He K., Girshick R. and Sun J.(2015), “Faster r-cnn: Towards real-time object detection with region proposal networks,” In Advances in neural information processing systems, pp.91-99.

Rublee E., Rabaud V., Konolige K. and Bradski G.(2011, November), “ORB: An efficient alternative to SIFT or SURF,” In Computer Vision (ICCV), 2011 IEEE international conference, IEEE, pp.2564-2571.

Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D. and Rabinovich A.(2015), “Going deeper with convolutions,” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.1-9.

Taghvaeeyan S. and Rajamani R.(2014), “Portable roadside sensors for vehicle counting, classification, and speed measurement,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 1, pp.73-83.

Udacity(2018, April 18), Udacity/self-driving-car, Retrieved September 11, 2018, from https://github.com/udacity/self-driving-car/tree/master/annotations.

Valk L.(2014), Lego mindstorms Ev3 Discovery Book: A beginner's guide to building and programming robots, No Starch Press.

Xiao J., Liao L., Hu J., Chen Y. and Hu R.(2015), “Exploiting global redundancy in big surveillance video data for efficient coding,” Cluster Computing, vol. 18, no. 2, pp.531–540.

Zhang C., Li H., Wang X. and Yang X.(2015), “Cross-scene crowd counting via deep convolutional neural networks,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.833-841.

Zhou Y., Nejati H., Do T. T., Cheung N. M. and Cheah L.(2016, October), “Image-based vehicle analysis using deep neural network: A systematic study,” In Digital Signal Processing (DSP), 2016 IEEE International Conference, IEEE, pp.276-280.