Real-time Human Detection under Omni-dir ectional Camera based on CNN with Unified Detection and AGMM for Visual Surveillance

(1)

1. INTRODUCTION

Since omni-directional cameras can provide a very wide field of view (up to 360° horizontally, 180° vertically) about scenes, they are popularly deployed in application areas like visual surveil- lance, where human detection is usefully demanded.

Human detection has a wide range of applications such as pedestrian detection for car driving safety, people counting for commercial marketing, human interaction in robotics, and so on in addition to hu- man intrusion detection in visual surveillance.

Thus, during the last two decades, huge intensive

research efforts have been poured into image- based human detection.

As opposed to researches on human detection under perspective cameras, where remarkable ach- ievements [1-9] have been accomplished during the last decade, those under omni-directional (fisheye) cameras [10-11] is far from satisfaction.

Exquisite human model like deformable part model [2] and deformation dictionaries [7] as well as most of effective hand-crafted feature vectors for human detection under perspective images like HoG [1, 8, 9] and Integral channel [3] cannot be directly applied to omni-images since they are se-

Real-time Human Detection under Omni-directional Camera based on CNN with Unified Detection

and AGMM for Visual Surveillance

Thanh Binh Nguyen

^†

, Van Tuan Nguyen

^†

, Sun-Tae Chung

^††

, Seongwon Cho

^†††

ABSTRACT

In this paper, we propose a new real-time human detection under omni-directional cameras for visual surveillance purpose, based on CNN with unified detection and AGMM. Compared to CNN-based state-of-the-art object detection methods. YOLO model-based object detection method boasts of very fast object detection, but with less accuracy. The proposed method adapts the unified detecting CNN of YOLO model so as to be intensified by the additional foreground contextual information obtained from pre-stage AGMM. Increased computational time incurred by additional AGMM processing is compensated by speed-up gain obtained from utilizing 2-D input data consisting of grey-level image data and foreground context information instead of 3-D color input data. Through various experiments, it is shown that the proposed method performs better with respect to accuracy and more robust to environment changes than YOLO model-based human detection method, but with the similar processing speeds to that of YOLO model-based one. Thus, it can be successfully employed for embedded surveillance application.

Key words: Human Detection; Omni-Directional Camera; CNN (Convolutional Neural Networks);

AGMM (Adaptive Gaussian Mixture Model); Visual Surveillance

※ Corresponding Author : Sun-Tae Chung, Address:

(06978) Dept. of Smart Systems Software, Soongsil Univ., 369, Sangdo-Ro, Dongjak-Gu, Seoul, Korea, TEL : +82- 2-820-0638, FAX : +82-2-821-7653, E-mail : cst@ssu.

ac.kr

Receipt date : May 9, 2016, Approval date : Aug. 9, 2016

†††

Dept. of Information and Telecommunication Engin- eering, Soongsil University

(E-mail : [email protected])

†††

Dept. of Smart Systems Software, Soongsil Uniersity

†††

School of Electronic and Electrical Engineering, Hongik University (E-mail : [email protected])

※ This work was supported by the Soongsil University

Research Fund

(2)

verely distorted from perspective images.

Now, CNN(Convolutional Neural Network) is now well-recognized as an excellent mechanism to work well for image-base application tasks since it can be structured to learn automatically models as well as low-level and high level features appro- priate for application tasks. Thus, recently, to han- dle the limited representation of hand-crafted fea- tures and the limited capability of the previous hu- man models to capture large variations and occlu- sions of human appearance, For object (including human) detection under perspective cameras, CNN-based approaches start to be proposed [12- 20]. Among them, YOLO model-based one pre- sented in [20] is noticeable for its extremely faster processing speed compared to other CNN-based methods in [12-19]. This is possible since YOLO model trains CNN in a unified way so that it can detect objects in a frame simultaneously in testing.

However, it is analyzed [20] that YOLO mod- el-based object (human) detection shows less background error (false positive) but more local- ization error (false negative) compared to the state- of-the-art CNN-based object detection methods like fast R-CNN [17]. Even though it is argued that YOLO model performs less background errors than fast R-CNN since YOLO model learns con- textual information about objects [20], it is ob- served through our experiments that YOLO model is not robust enough to environment changes in- cluding background scene changes unless YOLO model neural network is trained for various envi- ronments with different background scenes.

YOLO model’s less accurate localization may come from the fact that its CNN network cannot learn more expressive representation about objects from input training images’s raw pixel information and reference information about object bounding boxes and confidence scores for the bounding box- es for the input training images.

In this paper, we concentrate on developing fast but accurate human detection under omni-direc-

tional cameras in the specific application domain of visual surveillance. One good thing about human detection for a specific application domain like vis- ual surveillance is that one can utilize the applica- tion domain prior knowledge [9]. In visual surveil- lance for designated areas, the background scene is fixed as opposed to image-based driver assis- tance systems where background scene is chang- ing. Humans are detected inside the foreground object areas which can be extracted as the mini- mum bounding boxes of foreground masks. Fore- ground masks under fixed backgrounds are usually extracted by background subtraction methods, among which AGMM (Adaptive Gaussian Mixture Model)-based background subtraction is well known to work successfully even under complex background scenes. The bounding boxes of fore- ground masks are highly plausible candidate re- gions for humans, and thus one can utilize the fore- ground bounding boxes as contextual information about foreground objects.

In this paper, we propose a new real-time human detection under omni-directional cameras for sur- veillance purpose based on CNN which supports a unified detection like YOLO model, but is in- tensified further about localization of object bound- ing boxes and confidence scores for the bounding boxes by foreground object context information extracted by AGMM.

The foreground contextual information provided

by the pre-stage AGMM process contributes to

training CNN network to learn object (human)

areas more confidently and more precisely so that

in testing, the trained CNN network performs more

accurate with respect to human detection. Also, the

false foreground object area information extracted

by AGMM is used to train the CNN network to

learn about the wrong foreground areas. Thus, in

testing, the trained CNN network will work toward

filtering out wrong human detection (false pos-

itive). Moreover, the contextual information about

foreground object areas turns out to train the CNN

(3)

network so as to be more robust to environment changes including background scene changes.

To learn contextual information for various backgrounds, CNN with a unified detection of YOLO model needs to be extensively trained against various backgrounds. If not, it turns out that YOLO model trained under one background does not perform well under other background since YOLO model trained under limited image da- ta set,, but not under extensive image data set would learn contextual information biased toward backgrounds in the limited image data set. When YOLO-based human detection for visual surveil- lance is considered for commercial deployment, collecting various surveillance backgrounds for training will be very costly. The proposed method in this paper compensates the bias of YOLO model twisted toward backgrounds in the trained dataset by letting the CNN network learn more weight on foreground object area and less weight on back- ground area during training. Less weight on back- ground area helps the CNN network in the pro- posed method less affected by different back- grounds. Different weight assignments are guided by foreground contextual information provided by AGMM.

Increased computational time incurred by addi- tional AGMM processing is compensated by speed-up gain obtained from utilizing 2-D input data for the unified CNN network, which consists of grey-level image data and foreground /back- ground context information data instead of original 3-D color input data. Further speed-up can be ach- ieved by realization that the CNN network in- tensified by foreground contextual information can produce the similar accuracy performance even with less layers since each convolutional layer can learn more efficiently.

The experimental results show that the pro- posed human detection method significantly im- proves accuracy and robustness to background changes without deterioration of processing speed

compared to YOLO model-based human detection [20], which makes the proposed human detection method suitable for real-time human detection in embedded surveillance systems.

In passing, it is worthwhile to remark that con- cerning contextual or semantic information learn- ing, the approach proposed in this paper is different from those in [13-15]. As opposed to the ap- proaches in [13-15] which construct new CNN network architectures with input image pixel data only, the CNN network in this paper has additional contextual information input in addition to input image pixel data without restructuring the CNN network framed in a unified regression way as in [19-20]. The previous researches about CNN-based object (human) detection [12∼20] are more con- cerned about how to train input image data so as to make CNN learn information about object loca- tion, object class and context for objects efficiently.

In this proposed method, it is shown that what data to train for CNN is also as important as how to train data for CNN.

The rest of the paper is organized as follows.

Section 2 introduces technical backgrounds and re- lated work necessary for understanding the works of this paper. Section 3 describes our proposed hu- man detection method. Experimental results are discussed in Section 4, and finally the conclusion is presented in Section 5.

2. TECHNICAL BACKGROUNDS AND RELATED WORK

2.1 AGMM (Adaptive Gaussian Mixture Model)- based Background Subtraction

The rationale in “background subtraction” for

detecting foreground objects in videos from static

cameras (the background scene of camera is fixed)

is to detect the foreground objects from the differ-

ence between the current frame and a reference

frame, often called “background model”. AGMM

(Adaptive Gaussian Mixture Model) [21] is a stat-

(4)

istical background modeling which is well-known to be effective for extracting foreground objects in complex background scenes. For the detailed AGMM-based foreground object extraction algo- rithm, readers are recommended to refer to [21].

In AGMM-based foreground object detection on background subtraction methods, each pixel in the scene is modeled by a mixture of K Gaussian dis- tributions which reflects variation of scenes and illuminations. But, background subtraction meth- ods are still sensitive to lighting variations and scene clutters, and have difficulty in handling the grouping and fragmentation problems, and more- over foreground object candidate areas extracted from background subtraction methods are not guaranteed to contain humans. And therefore, con- volutional neural networks are successfully utilized to learn about all variations and to verify whether the foreground object candidate areas extracted by AGMM contain human or not.

Background subtraction methods based on AGMM contains two significant parameters, the learning rate constant ^ in the Gaussian statistics update and the background threshold ^



, a meas- ure of the minimum portion of the data that should be accounted for by the background. When the val- ue of two parameters is changed, it will affect false negative (missing detection) and false positive (false detection). This will be mentioned in detail on Section 4, Experimental Results.

2.2 CNNs (Convolutional Neural Networks) Convolutional Neural Networks (also called as CNNs or ConvNets) is a kind of deep learning net- works which can automatically learn models as well as low-level and high level features appro- priate for application tasks. Thus, it has been suc- cessfully utilized in many commercial applications from image classification, object detection to scene labelling [22].

Although the basic concepts of CNN are known since 1980 [23], they did not receive a lot of atten-

tion until the last few years [24-25]. Due to theo- retical advances, the increased availability of com- putational resources and larger amounts of data and excellent performance, CNN-based approaches are now considered state-of-the-art in many vi- sion tasks including human detection.

As well-known now, networks of CNN have three essential layers; convolutional layer for learning features, pooling layer for lowering com- putational burden by reducing the number of con- nections between convolutional layers, and fully connected layer for classification. Recently, pooling layer of CNN is restructured to speed up process- ing by utilizing spatial pyramid pooling as in [26]

or to exploit the spatial relation of semantic fea- tures by spatially weighted max pooling [15].

2.3 CNN with Unified Detection [20]

YOLO model [20] is a CNN model with unified detection which can detect multiple objects in a image simultaneously through a single neural net- work in a regression formulation. It divides the im- age into a S x S grid and simultaneously predicts bounding boxes of objects, confidence in those boxes, and class probabilities. If the center of an object falls into a grid cell, that grid cell is respon- sible for detecting that object. Each grid cell pre- dicts B bounding boxes and confidence scores for those boxes. Confidence score is defined to equal the intersection over union (IOU) between the pre- dicted box and the ground truth. These confidence scores reflect how confident the model is about that the box contains an object. Each bounding box consists of 5 predictions: x, y, w, h, and confidence score. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.

The width, w and height, h are predicted relative

to the whole image. Each grid cell also predicts C

conditional class probabilities. The YOLO model

only predicts one set of class probabilities per grid

cell, regardless of the number of boxes B. Thus,

these predictions are encoded as an S × S × (B

(5)

× 5 + C) tensors where S × S is the size of selected grid, B is number of bounding box for each cell, C is the number of object classes.

About how the network of YOLO model is con- structed and how it is trained, readers are recom- mend to refer to [20].

2.4 Related Work

During the last two decades, huge intensive re- search efforts have been poured into on human de- tection from image analysis since it has a wide range of applications such as intrusion detection in visual surveillance, pedestrian detection for car driving safety, people counting for commercial marketing, human interaction in robotics, and so on.

Many noticeable human detection methods pro- posed and tested during the last decades [1-9] are formulated as classification problem, where classi- fiers are trained using carefully hand-crafted fea- ture vectors such as HoG [1] and channel features [3]. To account for view and pose variations of hu- mans, combined features [3] or exquisite human models like deformable part model [2] or several approaches [6∼7] have been tried. However, the representation of hand-crafted features cannot be optimized for pedestrian detection, and neither can any exquisite model so far to capture large varia- tions of human appearance and occlusions of hu- man bodies.

To overcome these problems, deep learning- based approaches have attracted attention. Deep learning, especially in CNN, can learn features and models from raw pixels to improve the perform- ance of human detection. In [12], CNN is con- structed so as to branch lower levels’ outputs into the top classifier to learn multi-stage features like both global shapes and structures and local details, such as a global silhouette and face components in the case of human detection. The Joint Deep learning model [13] formulates pedestrian detection in joint part-based model and designs CNN to learn

pedestrian in 20 parts model and estimates pedes- trian from scores of 20 parts. [12] and [13] for- mulate pedestrian detection as a single binary clas- sification task (pedestrian or not pedestrian) and thus they may confuse positive (pedestrian) with hard negative samples (not pedestrians but pedes- trian-like object). In order to handle this ambiguity, TA-CNN [14] constructs CNN so as to learn dis- criminative representation including semantic con- texts for pedestrian detection by jointly optimizing it with semantic attributes including pedestrian at- tributes (e.g. ‘carrying backpack’) and scene at- tributes (e.g. ‘vehicle’, ‘tree’, and ‘horizontal’). Then, learning semantic contexts helps to discriminate between pedestrians and pedestrian-like objects during running and shows better accuracy per- formance than state-of-the-art methods do. How- ever, TA-CNN [14] should learn 9 pedestrian at- tributes, and 8 scene attributes in addition to binary classification information so that it has computa- tional burden, which is disadvantageous for em- bedded implementation.

[15] proposes a new module to learn spatially weighted max pooling module in CNN to jointly model the spatial structure and appearance of high-level semantic features, and applies the new CNN model to pedestrian detection for evaluation.

Experiments through several pedestrian detection databases shows that it performs better than tradi- tional CNN.

All of [12-15] are to localize pedestrian by slid- ing window approach, which requires to scan mul- ti-scale image windows all over the whole image frame. It is well-known that the sliding window approach costs computationally high [9, 16]. To improve processing speed, selective search ap- proaches like region-CNN [16], Fast R-CNN [17]

apply CNN only over region proposals extracted from the pre-stage processing, but not over the whole image frame.

All these prior works [12-17] formulate object

detection as applying classifiers to each image re-

(6)

gion whether it is a sliding window or a selective region so that they still have to apply the heavy computing CNN many times before finishing.

However, all these prior approaches are still heavy for real-time processing for embedded systems.

[18-20] reformulate the object detection as a re- gression problem and train CNN to learn the effi- cient regression algorithm between predicted bounding boxes and ground truth bounding boxes of the objects. In addition to the coordinates and box sizes of the matching bounding boxes, how much the matching is achieved is designated by the confidence score which represents how much the matching is achieved. Deep MultiBox [18]

trains a convolutional neural network in a re- gression to predict regions of interest. [19] is more concerned about finding a single graspable area for an object. It does not need to know anything about the object extent or its center. On the other hand, YOLO model proposed in [20] involves joint opti- mization of the classification and localization error, unlike [19]. YOLO model [20] shows much faster processing compared to other CNN-based state-of the-art object detection methods by framing re- gression formulation of object detection in a unified way in the sense that YOLO model trains CNN to learn to predict a distribution over class labels as well as a bounding box of the object in each grid cell simultaneously over 7 × 7 grid of the input image. Certainly YOLO model gains significant computational advantage over region proposals [16-17] in a detecting a few classes as in human detection. Thus, YOLO model-based method can be considered as a very promising CNN candidate architecture for object detection in embedded sys- tem without any powerful GPU or special vision DSP cores.

Due to large field of view, omni-directional lens cameras are popularly used for various applica- tions, such as visual surveillance. For surveillance purpose, human detection under omni-directional cameras have been actively studied during past

decades. Mainly two approaches have been pro- posed in the literature; 1) application of successful techniques developed under perspective setting and 2) different techniques directly applied for om- ni-images. A well-known technique in perspective environments can be applied to human detection under fisheye cameras [27-28]. But, persons sitting or lying down or standing around the center of the omni-images do not appear like persons of posture well-detected under perspective scenes even after the omni-image is converted into a perspective or a panoramic image. Also, converting the omni-im- age into a perspective or a panoramic image costs additional computational time. Thus, many differ- ent direct approaches have been proposed; mod- ification of HoG feature so as to be appropriate for omni environments [10] or formulation of human detection as Bayesian MAP framework utilizing a shape-based detector [11]. These methods only consider pedestrians (walking person) appeared at off-center regions and do not show superior performance.

Likewise in the perspective setting, to handle difficulty of extracting hand-crafted features and human models appropriate under omni-image set- ting, CNN approaches start to appear, but a very few. [29] applies deep CNN in classifier formulation to human detection under omni-images and reports better accuracy performance compared to that of one conventional popular method, combination of HoG and Adaboost through their home-grown dataset. However, [29] does not explain how to process for multiple human detection in an image frame, and it does not provide any comparison to other methods about processing speed.

3. PROPOSED HUMAN DETECTION METHOD

3.1 Workflow of the Propsoed Method

Fig. 1 shows the workflow of the proposed hu-

man detection method for visual surveillance based

on the CNN with unified detection and foreground

(7)

contextual information. How it works will be ex- plained in later sections.

3.2 Extraction of Foeground Iformation using AGMM- based Bckground Sbtraction

In visual surveillance applications, detection of foreground objects, especially human, is very de- manding. Under a fixed surveillance camera, the information about foreground object regions can be estimated from background subtraction based on background modeling. The information about fore- ground object region can serve intensified in- formation about location and confidence in detect- ing humans. Thus, in this proposed method, the in- formation about foreground rectangular regions extracted from the adopted AGMM is additionally supplied to the CNN network with unified detection together with image pixel data so that the CNN network is trained to learn location and confidence of human objects more distinctively.

To extract foreground rectangular region from an image frame image, we adopt AGMM-based background subtraction method [9, 21]. Fig. 2 shows the steps of our adopted AGMM-based ex- traction of foreground rectangular regions and as- sociated images in each step.

Fig. 2. Steps of the adopted AGMM-based foreground mask extraction; (A) original color omni-Image, (B) gray-level resized image, (C) foreground masks, (D) corrected and bounding foreground rectangular region.

3.3 Network Design

Network of the CNN with unified detection in the proposed method is designed to learn human detection more distinctively by providing fore- ground rectangular region information in addition to gray-level image data into the network as inputs.

The CNN network in the proposed method is in- spired by YOLO Tiny model [30] and modified for human detection task. While YOLO Tiny model is constructed for detection of 20 class objects, hu- man detection is just a one class object detection problem so that one can simplify YOLO Tiny mod- el more. The CNN network for human detection proposed in this paper is constructed to consist of 5 convolutional layers, 4 max pooling layers fol- lowed by 2 fully connected layers, dropout layer and output detection layer. The final output de- tection lay of the proposed network is the 7 × 7 × 11 = 539 tensors of predictions. 7 × 7 comes from 7 × 7 number of grid cells and 11 is calculated by (5*B+C) (B ; number of bounding boxes, C ; num- ber of classes) where we choose B=2 to detect overlapped or closely located humans and C=1 since human detection is 1 class object detection.

Fig. 3 shows the proposed CNN network archi- tecture for human detection.

Compared to YOLO Tiny model (9 convolutional layers, 6 max pooling layers, 3 fully connected lay- ers, 1 dropout layer, 1 output detection layer) [30], the proposed CNN obviously has reduced layers, Fig. 1. The workflow of the proposed human detection

method for visual surveillance.

(8)

but more importantly far less numbers of filters in convolutional layers since human detection needs to detect only one class (human) as opposed to YOLO model constructed to classify 20 classes.

3.4 Training

Training of the proposed CNN with unified de- tection is the same as that of YOLO model [20]

except that training input data consists of fore- ground context information whether pixels belong to foreground (1) or background (0) in addition to one channel image pixel gray-level value. Original YOLO model’s training input data consists of 3 color channel (R, G, B) values.

In the same way as YOLO model, the proposed network updates the parameters of neural network during training by back propagation of the squared error between the network output and the ground truth information (x, y, w, h, confidence score).

Pre-stage AGMM can generate false positives (wrong foreground mask regions) as well as true negatives (missing foreground mask regions) in addition to true positives (true foreground mask regions). The wrong foreground context informa- tion as well as the true foreground information (true positive) are all employed to intensify the proposed network to learn to accept true object information and to learn to filter out wrong object information. These additional foreground con- textual information contributes to improvement of

detection accuracy. Fig. 4 illustrates an example of filtering out false positives after learning (training).

Fig. 4. The proposed network can learn to filter out the wrongly extracted as a foreground mask.

4. EXPERIMENTAL RESULTS

4.1 Experimental Environments

In order to evaluate our proposed method, we used 2 kinds of omni-directional image dataset:

a well-known data set, Bomni-DB [31] and a home-grown dataset.

Bomni-DB consists of videos recorded in a room with two omni-directional cameras where the bounding boxes and actions of people are annotated.

Each image frame of Bomni-DB has resolution of 640×480. Omni-directional cameras are located at the top and side of the room. There are two differ- ent scenarios, single-person and three-people recorded. Single-person and three-people in Bomni- DB means that clearly appearing person is one or three. For experiments of this paper, single-person Bomni-DB and three-people taken from the top omnidirectional camera are adopted, and in the ex- periments of the paper, each is denoted as Bomni- 1-DB and Bomni-3-DB, respectively.

Fig. 5 and Fig. 6 show some sample images from Bomni-1-DB and Bomni-3-DB, respectively.

Fig. 3. The network architecture of the CNN with unified detection of the proposed human detection method.

Fig. 5. Bomni-1-DB.

(9)

Home-grown dataset cosists of omni-direc- tional video of 704 × 576 taken form the top at a company in Seoul, Korea. This home-grown DB will be denoted as HG-DB in this paper. Fig. 7 shows some images from the home-grown DB.

PC used for experiments has the following specification: Intel Core i5 6600 3.3GHz, 16GB RAM, and Nvidia Titan II graphic card. The graphic card Titan II is used only for training, and not used for testing (evaluation). OS for the testing PC is Windows 10 64bit.

Fig. 7. HG-DB (Home-grown Dataset).

4.2 Evaluation Methodology

In object detection, performance is commonly evaluated by accuracy and processing speed. The processing speed is simply measured by counting how many frames are processed in a sec or by cal- culating total consumed time in processing whole dataset. For accuracy evaluation, many methods adopt different measures even though they are similar in spirits. In this paper, accuracy is eval- uated by precision and recall. Calculation of pre- cision and recall requires the clear definition of measure for the correct detection. Our employed measure for the correct detection is PASCAL measure [4], which determines correct object de- tection if the area of overlap between the detect

bound box ( ^



) and the ground truth bounding box ( ^



) exceed 50% described as (1) follows.

 

_

∪

_





_

∩

_



  (1)

Now, if we denote the number of correctly de- tected human objects (true positives), the number of falsely detected human objects (false positives), and the number of missed human objects (false negatives) as TP, FP, and FN, respectively, pre- cision, recall and overall are defined as follows.

Pr     

 (2)

     

 (3)

Overall   Pr  

Pr

 (4)

Actually, Precision = 1 - False Alarm Rate, and Recall = 1 - Miss Rate. Overall acts as a a single measurement for comparison of different methods.

Because AGMM provides foreground contextual information to the proposed CNN network, the ac- curacy performance of the proposed method is af- fected by performance of AGMM. How much pre- cisely AGMM produces foreground information depends on AGMM model parameters of learning rate ^ and background threshold ^



. In stead of testing for various values of learning rate ^ and background threshold ^



, which can make experi- ments a little complicate, we rather decide to use the values of two parameters for best performing AGMM over training videos from Bomni-1-DB.

By this way, we adopt the learning rate ^{  } , and the background threshold ^



  for AGMM of the experiments in this paper.

4.3 Experimental Results

Human detection in embedded systems needs a

light and fast algorithm since embedded systems

have limited computing capability including less

powerful GPU. Even though YOLO model-based

human detection method works much fast com-

pared to other CNN-based state-of-the-art object

Fig. 6. Bomni-3-DB.

(10)

detection method, YOLO model-based method has drawback with respect to accuracy. One of the ma- jor purpose of this paper is to develop an accurate human detection method which can operate reliably in real-time under normal embedded environments.

Through experiments in this paper, we evaluate the proposed methods against YOLO (Tiny) mod- el-based human detection methods with respect to accuracy and processing speed.

Compared human detection methods are as fol- lows; YOLO Tiny Model-based human detection method, YOLO Simplified Model-based human de- tection method, AGMM-based human detection method. YOLO Tiny Model-based human detection method uses the same network architecture and parameters as in [30], YOLO Simplified Model- based human detection method uses the same net- work architecture as Fig. 3 but with only RGB pixel data input. AGMM-based human detection method decides the minimum bound boxes of fore- ground masks as human object region. AGMM- based human detection method just estimates fore- ground object regions, but it cannot precisely iden- tify foreground objects. The evaluated proposed human detection methods are two and they are de- noted as Proposed Method_1, and Proposed Method_2. Proposed Method_1 is the human de- tection method which uses the network archi- tecture of Fig. 3 and applies both grey-level image pixel data and foreground information from AGMM process to the network as inputs. Proposed Method_2 is the same as Proposed Method_1 ex- cept that the last convolutional layer of Fig. 3 is eliminated. Consideration of eliminating the last convolutional layer in the Proposed Method_ 2 comes from visualization analysis about what each convolutional layers learn. After convolutional lay- ers are visualized, it is observed that the last two convolutional layers of the Proposed Method_1 have learned almost same features. Thus, one can cut off the last convolutional layer but achieve the similar accuracy (precision and recall) with speed

up. Fig. 8 shows visualizations of the weights of convolutional layers in Proposed Method_1's network.

For accuracy evaluation, testing robustness of the methods against different environments from training environments is important. For this, we first train the proposed method and YOLO mod- el-based human detection method using sin- gle-person Bomni-DB (Bomni-1-DB), and then evaluate precision and recall of several both meth- ods against single-person Bomni-DB (Bomni-1- DB), three-people Bomni-DB (Bomni -3-DB) and the home-grown DB (HG-DB). We also evaluate processing speed of the proposed methods and compare them to those of YOLO model-based hu- man detection methods.

For training of methods (proposed ones and YOLO model-based ones), we extracted over 2000 images from all video in Bomni-1-DB (5 videos) and used this images for training process.

4.3.1 Evaluation with respect to accuracy 1) Testing under the same environments as

training dataset

We first evaluated accuracy over 2 videos (Bomni-1-DB_Video1, Bomni-1-DB_Video2) from single-person Bomni-DB (Bomni-1-DB), with the same background as training dataset. Bomni-1- DB_Video1 and Bomni-1-DB_Video2 consist of 1001 frames and 794 frames, and contain 906 per- sons and 690 persons, respectively.

Table 1 shows the experimental results about evaluation of methods over Bomni-1-DB_Video1 and Bomni-1-DB_Video2. The experimental data in Table 1 shows that both YOLO Model-based methods and the proposed methods can work well Fig. 8. Visualization of Convolutional layer in Proposed

Method_1 ’s network.

(11)

with images with the same environments as that of the trained images.

Original YOLO Tiny Model-based human de- tection method which has the network with more layers and more filters performs better than YOLO Simplified Model-based one of less layers and less filters. However, it costs computationally very

much compared to YOLO Simplified Model-based human detection method as shown in Table 4.

Fig. 9 shows some examples of the experiment of Table 1.

From Fig. 9-(2), one can see that YOLO Model (Simplified) does not learn enough objects (humans) from the trained data so that it may not detect a new object (human) with data different from trained data, but the proposed method can learn better from additional foreground context infor- mation. The image data of Fig. 9-(3) is very diffi- cult in the sense that all objects (humans) cannot be clearly differentiated from parts of background.

Thus, AGMM cannot generate useful foreground information. Thus, the proposed method also fails to detect humans in scene.

The experimental data about AGMM Only Model-based human detection in Table 1 shows it misses to detect human or it often detects human falsely. Fig. 10 shows the case where AGMM gen- erates false alarm and misses to detect humans but the proposed method can filter out the false alarm Table 1. Comparison with respect to accuracy over Bomni-1-DB which training data set is collected from

Method Name

Bomni-1-DB_Video1

(906 persons/1001 frames) Bomni-1-DB_Video2 (690 persons/794 frames) Precision Recall Overall Precision Recall Overall

YOLO Tiny Model 93.58 98.99 96.21 90.34 98.66 94.32

YOLO Simplified Model 91.49 98.67 94.94 88.02 97.83 92.67

AGMM 67.87 69.52 68.69 66.75 68.55 67.64

Proposed Method_1 97.71 99.01 98.36 98.01 99.71 98.85

Proposed Method_2 96.95 98.59 97.76 97.89 98.99 98.43

Fig. 9. The human detection results in Bomni-1-DB; (1) Both YOLO Model(Simplified) method and Pro- posed method (Method_1) detect successfully, (2) Proposed method (Method_1) works well but the YOLO Model (Simplified) method fails to de- tect all humans, (3) Both fails.

Table 2. Comparison with respect to accuracy over Bomni-3-DB which has the same background scene as training dataset but 3 people appearance

Method Name

Bomni-3-DB_Video1

(1104 persons/557 frames) Bomni-3-DB_Video2 (1025 persons/556 frames) Precision Recall Overall Precision Recall Overall

YOLO Tiny Model 52.95 56.35 54.60 59.38 64.75 61.95

YOLO Simplified Model 50.92 53.35 52.11 57.42 63.02 60.09

AGMM 64.85 66.53 65.68 64.72 67.56 66.11

Proposed Method_1 86.73 94.29 90.35 89.60 94.66 92.06

Proposed Method_2 85.89 93.53 89.55 88.95 93.98 91.40

(12)

but detect missing humans since it can learn to de- tect missing and to filter out false alarm.

2) Testing about robustness to different envi- ronments from training environments Object detection method (algorithm) needs to work reliably well under various environments. In order to evaluate reliable accuracy, we did two ex- periments; 1) one under the same background scene as that of training set but with multiple peo- ple appearance, 2) one with the different back- ground scene from that of training data set.

The experimental data of Table 2 was obtained from evaluation of accuracy over 2 testing videos (Bomni-3-DB_Video1, Bomni-3-DB_Video2) from three-people Bomni-DB (Bomni-3-DB), which has the same background scene as the training data set but 3 people appearance. Fig. 11 shows some ex-

amples of the experiment of Table 2.

Table 3 shows the experimental data with re- spect to accuracy against home-grown DB (HG- DB) videos with different environments from training data set. The experimental data in Table 3 shows that the proposed methods are even robust to background scene change.

Fig. 12 shows some examples of the experiment of Table 3.

Experimental data in Table 2 and Table 3, together with Fig. 11 and Fig. 12 imply that the proposed method performs more robust to environment changes better than YOLO model does since it can learn better object regions from foreground context information.

Fig. 10. The examples to show that the proposed meth- od works well but the AGMM only method can- not work.

Table 3. Comparison with respect to accuracy over home-grown DB with different background scene from training dataset

Method Name

HG-DB_Video1

(870 persons/870 frames) HG-DB_Video2

(1384 persons/820 frames) Precision Recall Overall Precision Recall Overall

YOLO Tiny Model 51.72 53.85 52.76 50.48 54.89 52.59

YOLO Simplified Model 47.01 51.41 49.11 47.11 51.59 49.25

AGMM 63.06 65.12 64.07 62.15 64.71 63.40

Proposed Method_1 84.81 87.41 86.09 85.18 88.05 86.59

Proposed Method_2 83.95 86.85 85.38 84.88 87.95 86.39

Fig. 11. The human detection results in Bomni-3-DB;

(1) Both YOLO Model(Simplified) method and

Proposed method (Method_1) detect success-

fully, (2) Proposed method (Method_1) works

well but the YOLO Model (Simplified) method

fails to detect all humans, (3) Both fails.

(13)

4.3.2 Evaluation with respect to processing speed We also measured processing speed with all testing videos. Table 4 shows the experimental re- sults about processing speed.

The experimental data in Table 4 shows that the proposed method is faster than YOLO Model methods. This speed up mainly results from re- duced input data dimension from 24 bits ( 3 R-G-B channels of input color image ) to 9 bits ( 8 bits data of input grey-level image and 1 bit foreground information.

5. CONCLUSIONS

In this paper, we presented a real-time human detection method under omni-directional camera for visual surveillance purpose. The proposed method is based on CNN with unified detection with additional foreground information extracted from AGMM-based background subtraction for embedded surveillance system. Since the proposed method learns more about foreground context with less input data dimension, it can achieve better ro- bust accuracy compared to YOLO model-based method [20] and it also shows faster processing speed compared to CNN-based fastest object de- tection method, YOLO model-based one.

Currently, we are porting the proposed method into a smart IP network camera, which will be reported.

REFERENCES

[ 1 ] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,”

P roceeding of IEEE Conference on Compu- ter Vision and Pattern Recognition, pp. 886- 893, 2005.

[ 2 ] P.F. Felzenszwalb, R.B. Girshick, D. Mc Table 4. Comparison with respect to processing speed

Videos

YOLO Tiny

Model YOLO Simplified

Model AGMM Proposed

Method_1 Proposed Method_2 Time

(s) Speed

(fps) Time

(s) Speed

(fps) Time

(s) Speed

(fps) Time

(s) Speed

(fps) Time

(s) Speed (fps) Bomni-1-DB_

Video1 662.91 1.51 6.11 163 0.89 1124 5.89 169 4.65 215

Bomni-1-DB_

Video2 552.37 1.52 4.91 161 0.68 1167 4.67 170 3.71 214

Bomni-3-DB_

Video1 359.35 1.55 3.42 162 0.48 1160 3.29 169 2.61 213

Bomni-3-DB_

Video2 361.04 1.54 3.37 164 0.48 1158 3.25 171 2.59 214

HG-DB_Video1 591.60 1.47 5.35 162 0.76 1144 5.11 170 4.05 215

HG-DB_Video2 565.80 1.45 5.05 162 0.72 1138 4.85 169 3.83 214

Fig. 12. The results in HG-DB; (1) Both YOLO Model

method and the proposed method detect suc-

cessfully, (2) The proposed method works well

but the YOLO Model cannot work, (3) Both

methods cannot detect human.

(14)

Allester, and D. Ramanan, “Object Detection with Discriminatively Trained Part-based Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, Issue 9, pp. 1627-1645, 2009.

[ 3 ] P. Dollar, S. Belongie, and P. Perona, “The Fastest Pedestrian Detector in the West,”

P roceeding of The British Machine Vision Conference, pp. 68.1- 68.11, 2010.

[ 4 ] P. Dollar, C. Wojek, B. Schiele, and P. Perona,

“Pedestrian Detection: An Evaluation of the State of the Art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, Issue 4, pp. 743-761, 2011.

[ 5 ] R. Benenson, M. Mathias, R. Timofte, and L.

Van Gool, “Pedestrian Detection at 100 Frames per Second,” P roceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2903-2910, 2012.

[ 6 ] R. Benenson, M. Omran, J. Hosang, and B.

Schiele, “Ten Years of Pedestrian Detection, What Have We Learned?,” P roceeding of European Conference on Computer Vision, pp. 613-627, 2014.

[ 7 ] B. Hariharan, C.L. Zitnick, and P. Dollár,

“Detecting Objects using Deformation Dic- tionaries,” Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1995-2002, 2014.

[ 8 ] K.M Bhuvanarjun and T.C. Mahalingesh,

“Pedestrian Detection in a Video Sequence using HOG and Covaraince Method,” Interna- tional J ournal of Electrical and Electronics Engineers, Vol. 7, Issue 1, pp. 183-190, 2015.

[ 9 ] T.B. Nguyen, V.T. Nguyen, and S.T. Chung,

“A Real-time Pedestrian Detection Based on AGMM and HOG for Embedded Surveil- lance,” Journal of Korea Multimedia Society, Vol. 18, No. 11, pp. 1289-1301, 2015.

[10] I. Cinaroglu and Y. Bastanlar, “A Direct Approach for Object Detection with Catadiop- tric Omnidirectional Cameras,” J ournal of

Signal, Image and Video Processing, Vol. 10, Issue 2, pp 413-420, 2016.

[11] M. Saito, K. Kitaguchi, G. Kimura, and M.

Hashimoto “Human Detection from Fish-eye Image by Bayesian Combination of Probabil- istic Appearance Models,” Proceeding of 2010 IEEE International Conference on Systems Man and Cybernetics, pp. 243-248, 2010.

[12] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian Detection with Unsu- pervised Multi-stage Feature Learning,”

Proceeding of the 2013 IEEE Conference on Computer Vision and P attern Recognition.

pp. 3626-3633, 2013.

[13] W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” IEEE International Conference on Computer Vision, pp. 2056- 2063, 2013.

[14] Y. Tian, P. Luo, X. Wang, and X. Tang,

“Pedestrian Detection Aided by Deep Learning Semantic Tasks,” IEEE Conference on Com- puter Vision and Pattern Recognition, pp.

5079-5087, 2015.

[15] F. Liu, Y. Huang, W. Yang, and C. Sun,

“High-level Spatial Modeling in Convolutional Neural Network with Application to Pedestri- an Detection,” Proceeding of IEEE 28th Canadian Conference on Electrical and Com- puter Engineering, pp.778-783, 2015.

[16] R. Girshick, J. Donahue, T. Darrell, and J.

Malik, “Region-based Convolutional Net- works for Accurate Object Detection and Segmentation,” IEEE Transactions on P at- tern Analysis and Machine Intelligence, Vol.

38, Issue 1, pp.142-158, 2015.

[17] R. Girshick, “Fast R-CNN,” P roceeding of IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.

[18] D. Erhan, C. Szegedy, A. Toshev, and D.

Anguelov, “Scalable Object Detection Using

Deep Neural Networks,” Proceeding of the

2014 IEEE Conference on Computer Vision

(15)

and Pattern Recognition, pp. 2155-2162, 2014.

[19] R-cnn minus R, http://arxiv.org/abs /1506.06981, Aug. 31th, 2016.

[20] You Only Look Once: Unified, Real-Time Object Detection, http://arxiv.org/abs/1506.

02640, Aug. 31th, 2016.

[21] C. Stauffer and C, W.E.L Grimson, “Adaptive Background Mixture Models for Real-Time Tracking,” P roceeding of Conference on Computer Vision and Pattern Recognition, pp. 246-252, 1999.

[22] Recent Advances in Convolutional Neural Networks, https://arxiv.org/abs/1512.07108, Aug. 31th, 2016.

[23] K. Fukushima, “Neocognitron: A Self-organ- izing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,” Biological Cybernetics, Vol. 36, Issue 4, pp 193-202, 1980.

[24] B. Boser, L. Cun, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel,

“Handwritten Digit Recognition with a Backpropagation Network,” P roceeding of Advances in Neural Information Processing Systems, pp. 396-404, 1990.

[25] A. Krizhevsky, I. Sutskever, and G.E. Hinton,

“Imagenet Classification with Deep Convolu- tional Neural Networks,” P roceeding of

Advances in Neural Information Processing Systems, pp. 1106-1114, 2012.

[26] Y. Gong, L. Wang, R. Guo, and S. Lazebnik,

“Multi-scale Orderless Pooling of Deep Convolutional Activation Features,” Pro- ceeding of European Conference Computer Vision, pp.1-17, 2014.

[27] M. So, D.K. Han, S.K. Kang, Y.U. Kim, and S.T. Jung, “Recognition of Fainting Motion from Fish-eye Lens Camera Images,” Pro- ceeding of The 23rd International Technical Conference on Circuits/Systems, Computers and Communication, pp. 1205-1208, 2008.

[28] S.W. Jeng and W.H. Tsai, “Using Pano-map- ping Tables for Unwarping of Omni-images into Panoramic and Perspective-view Images,”

Proceeding of IET Image P rocessing, pp.

149-155, 2007.

[29] H. Asanuma, K. Okamoto, and K. Kawamoto,

“Feature Learning Based Human Detection for Omnidirectional Images,” Journal of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol. 27, pp. 813-824, 2015.

[30] YOLO Project, http://pjreddie.com/darknet/

yolo/, (accessed Aug., 10, 2016).

[31] Bomni-DB Homepage, http://www.cmpe.boun.

edu.tr/pilab/pilabfiles/databases/bomni/,

(accessed Aug., 10, 2016).

(16)

Thanh Binh Nguyen He received the B. Eng. degree in computer science from the University of Science, Ho Chi Minh, Vietnam, in 2005, the M.

Sc. degree in information and telecommunication engineering from the University of SoongSil, Seoul, South Korea, in 2010, and pursuing the Ph.D.

program in engineering at the University of SoongSil, Seoul, South Korea, from 2012. He is currently a prin- cipal software R&D researcher at Embedded Vision Inc., Seoul, South Korea..assistant at Embedded Real- time Computing Lab., University of Soongsil, Seoul, South Korea. His research interests cover the design and analysis of various smart embedded software sys- tem, I.O.T and also intelligent image, video analytic algorithms which is applied to visual surveillance, rec- ognition systems, and etc.

Van Tuan Nguyen

He received the Engineer degree in electronics and computer en gineering from the Hanoi Uni versity of Science and Technol- ogy, Hanoi, Vietnam, in 2014. He is currently a research assistant at Embedded Real-time Com- puting Laboratory, Soongsil University, Seoul, South Korea. His main areas of research interest are em- bedded systems, image processing, visual surveillance, recognition systems, deep learning.

Sun-Tae Chung

He received B.E. degree from Seoul National University, and M.S. degree and Ph.D. degree in Electrical Eng. and Computer Science from the University of Michigan, Ann Arbor, USA, in 1986 and 1990, respectively. Since 1991, he had been with the School of Electronic Eng.

at the Soongsil university, Seoul, Korea where he is now a full professor. Now, he has been with the Dept.

of Smart Systems Software, at the Soongsil Univ.

since 2015. His research interests include: computer vision, visual surveillance, biometrics, and embedded systems.

Seongwon Cho

He received B.E. degree from Seoul National University, and MS. degree and Ph.D. degree in Electrical and Computer Engin- eering from the Purdue Univer- sity, Indiana, USA in 1987 and 1992, respectively. He is a pro- fessor of the School of Electronic and Electrical Engineering at the Hongik University, Seoul, Korea.

His research interests include image processing, rec-

ognition, and intelligent systems.

Real-time Human Detection under Omni-dir ectional Camera based on CNN with Unified Detection and AGMM for Visual Surveillance

1. INTRODUCTION

Since omni-directional cameras can provide a very wide field of view (up to 360° horizontally, 180° vertically) about scenes, they are popularly deployed in application areas like visual surveil- lance, where human detection is usefully demanded.

Human detection has a wide range of applications such as pedestrian detection for car driving safety, people counting for commercial marketing, human interaction in robotics, and so on in addition to hu- man intrusion detection in visual surveillance.

Thus, during the last two decades, huge intensive

research efforts have been poured into image- based human detection.

As opposed to researches on human detection under perspective cameras, where remarkable ach- ievements [1-9] have been accomplished during the last decade, those under omni-directional (fisheye) cameras [10-11] is far from satisfaction.

Exquisite human model like deformable part model [2] and deformation dictionaries [7] as well as most of effective hand-crafted feature vectors for human detection under perspective images like HoG [1, 8, 9] and Integral channel [3] cannot be directly applied to omni-images since they are se-

Real-time Human Detection under Omni-directional Camera based on CNN with Unified Detection

and AGMM for Visual Surveillance

Thanh Binh Nguyen

, Van Tuan Nguyen

, Sun-Tae Chung

, Seongwon Cho

ABSTRACT

Key words: Human Detection; Omni-Directional Camera; CNN (Convolutional Neural Networks);

AGMM (Adaptive Gaussian Mixture Model); Visual Surveillance

※ Corresponding Author : Sun-Tae Chung, Address:

(06978) Dept. of Smart Systems Software, Soongsil Univ., 369, Sangdo-Ro, Dongjak-Gu, Seoul, Korea, TEL : +82- 2-820-0638, FAX : +82-2-821-7653, E-mail : cst@ssu.

ac.kr

Receipt date : May 9, 2016, Approval date : Aug. 9, 2016

Dept. of Information and Telecommunication Engin- eering, Soongsil University

(E-mail : [email protected])

Dept. of Smart Systems Software, Soongsil Uniersity

School of Electronic and Electrical Engineering, Hongik University (E-mail : [email protected])

※ This work was supported by the Soongsil University

Research Fund

verely distorted from perspective images.

In this paper, we concentrate on developing fast but accurate human detection under omni-direc-

The foreground contextual information provided

by the pre-stage AGMM process contributes to

training CNN network to learn object (human)

areas more confidently and more precisely so that

in testing, the trained CNN network performs more

accurate with respect to human detection. Also, the

false foreground object area information extracted

by AGMM is used to train the CNN network to

learn about the wrong foreground areas. Thus, in

testing, the trained CNN network will work toward

filtering out wrong human detection (false pos-

itive). Moreover, the contextual information about

foreground object areas turns out to train the CNN

network so as to be more robust to environment changes including background scene changes.

The experimental results show that the pro- posed human detection method significantly im- proves accuracy and robustness to background changes without deterioration of processing speed

compared to YOLO model-based human detection [20], which makes the proposed human detection method suitable for real-time human detection in embedded surveillance systems.

In this proposed method, it is shown that what data to train for CNN is also as important as how to train data for CNN.

The rest of the paper is organized as follows.

Section 2 introduces technical backgrounds and re- lated work necessary for understanding the works of this paper. Section 3 describes our proposed hu- man detection method. Experimental results are discussed in Section 4, and finally the conclusion is presented in Section 5.

2. TECHNICAL BACKGROUNDS AND RELATED WORK

2.1 AGMM (Adaptive Gaussian Mixture Model)- based Background Subtraction

The rationale in “background subtraction” for

detecting foreground objects in videos from static

cameras (the background scene of camera is fixed)

is to detect the foreground objects from the differ-

ence between the current frame and a reference

frame, often called “background model”. AGMM

(Adaptive Gaussian Mixture Model) [21] is a stat-

istical background modeling which is well-known to be effective for extracting foreground objects in complex background scenes. For the detailed AGMM-based foreground object extraction algo- rithm, readers are recommended to refer to [21].

Background subtraction methods based on AGMM contains two significant parameters, the learning rate constant  in the Gaussian statistics update and the background threshold 

Although the basic concepts of CNN are known since 1980 [23], they did not receive a lot of atten-

tion until the last few years [24-25]. Due to theo- retical advances, the increased availability of com- putational resources and larger amounts of data and excellent performance, CNN-based approaches are now considered state-of-the-art in many vi- sion tasks including human detection.

or to exploit the spatial relation of semantic fea- tures by spatially weighted max pooling [15].

2.3 CNN with Unified Detection [20]

The width, w and height, h are predicted relative

to the whole image. Each grid cell also predicts C

conditional class probabilities. The YOLO model

only predicts one set of class probabilities per grid

cell, regardless of the number of boxes B. Thus,

these predictions are encoded as an S × S × (B

× 5 + C) tensors where S × S is the size of selected grid, B is number of bounding box for each cell, C is the number of object classes.

About how the network of YOLO model is con- structed and how it is trained, readers are recom- mend to refer to [20].

2.4 Related Work

[15] proposes a new module to learn spatially weighted max pooling module in CNN to jointly model the spatial structure and appearance of high-level semantic features, and applies the new CNN model to pedestrian detection for evaluation.

Experiments through several pedestrian detection databases shows that it performs better than tradi- tional CNN.

apply CNN only over region proposals extracted from the pre-stage processing, but not over the whole image frame.

All these prior works [12-17] formulate object

detection as applying classifiers to each image re-

gion whether it is a sliding window or a selective region so that they still have to apply the heavy computing CNN many times before finishing.

However, all these prior approaches are still heavy for real-time processing for embedded systems.

Due to large field of view, omni-directional lens cameras are popularly used for various applica- tions, such as visual surveillance. For surveillance purpose, human detection under omni-directional cameras have been actively studied during past

Background subtraction methods based on AGMM contains two significant parameters, the learning rate constant ^ in the Gaussian statistics update and the background threshold ^

bound box ( ^

) and the ground truth bounding box ( ^