Professional Documents
Culture Documents
Keshav Shenoy
Table of Contents
Rationale of Study…………………………………………………………………………………3
Concept Map………………………………………………………………………………………4
D. Definition of Terms………….…………………………………………….…………..7
E. Assumptions………….………………………………………………………………..8
G. Importance of Study………….………………………………………………………10
Chapter 3: Methodology……………...………………………………………………………….15
E. Data Tables………………….…………………………………………….…………20
References…………………………….………………………………………………………….21
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 3
Rationale of Study
Autonomous driving is a growing field of study with large numbers of potential commercial
Convolutional Neural Networks (CNN) to recognize traffic light signals, but CNNs have
significant flaws, most notably in their ability to evaluate positional data. As a result, this
research will investigate the possible benefits of implementing Capsule Neural Networks
(CapsNets) in place of CNNs. Specifically, the research will look for improvements in final
accuracy with faster minimization of loss. The basis for this hope can be found in the work of
Kumar, Arthika, and Parameswaran (2018), who implemented CapsNets in traffic sign
classification with positive results and a 97.6% accuracy (p. 4546). CNNs have been the leading
edge of image recognition for a long period of time and, as such, an alternative with a significant
vehicles. The researchers wish to observe the benefits and the drawbacks of CapsNet architecture
over that of CNNs. The comparison is multifaceted and could lead to further research for the
researcher.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 4
a. Overarching Question
How can CapsNets be used to improve traffic light image recognition applications for
autonomous driving?
This study will determine the potential of CapsNets to improve the final accuracy of traffic light
image recognition. This necessitates that the CapsNet can minimize loss at a faster rate than
current CNNs.
Currently, according to Hinton (2018), CNNs are the predominant machine learning technique
being used for detecting and identifying objects within images (7:00). This has led to their
inclusion within multiple language libraries like Keras and TensorFlow. Unfortunately, like any
other emerging technology, there are a number of flaws with CNNs. Hinton (2018) has proposed
that many of these flaws can be remedied through the alteration of CNNs into a new, similar
neural network structure called a CapsNet (3:09). This research applies Hinton’s assertion to the
field of autonomous driving, where CNNs are used to assist autonomous motor vehicles in
autonomous driving?
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 6
Hypothesis: The most utilized current image recognition applications are CNN models, which
Foundation Sub-problem 2: What is the benefit of Capsule Neural Networks over current image
recognition applications?
Hypothesis: CapsNets will have more accurate results because of positional data preservation
Applied Sub-problem 1: How do CapsNets perform when implemented in traffic light image
recognition?
Hypothesis: A trained CapsNet will recognize traffic lights from multiple autonomous driving
related image datasets with more than 85% final validation accuracy.
Independent Variables: The model, build, and design of the machine learning (ML) system that
Dependent Variable: Final accuracy of CapsNet operating on validation data after training.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 7
D. Definition of Terms
a. Terms
devoted to making machines intelligent...” (as cited in Stone et al., 2016, p.12)
“neurons,” loosely based on the organization of certain neurons in human brains (Rawat
from layers of convolutional and pooling layers (Rawat & Wang, 2017, p. 2354).
− Capsule Neural Network: A type of artificial neural network that modifies convolutional
neural networks by segmenting groups of neurons into capsules for the better evaluation
− Image Recognition (or Image Classification): “…the task of categorizing images into one
− Convolutional Layers: “…serve as feature extractors, and thus they learn the feature
algorithms, to work with extremely large data sets.” (Stone et al., 2016, p. 9)
− Pooling Layer: LeCun et al. (1989a), LeCun et al. (1989b), LeCun et al. (1998), and
Ranzato et al. (2007) claimed that pooling layers “…reduce the spatial resolution of the
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 8
feature maps and thus achieve spatial invariance to input distortions and translations” (as
deformation, velocity, color, and more, which is recorded by CapsNets. (Hinton, 2018,
3:23)
− TF: TensorFlow
E. Assumptions
It is assumed that the datasets accurately represent the population of traffic lights that
autonomous motor vehicles would encounter in practice. While traffic lights are not very
It is assumed that the performance of the produced CapsNet after training accurately models the
It is assumed that the dataset developers annotated the datasets with the correct bounding boxes
and signal.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 9
It is assumed that the power of the Central Processing Unit and processors of the computer used
It is assumed that the ML algorithm can be created and tested at full potential within the TF API
There are many different types of ML algorithms and neural networks. The research will be
confined to the performance of CNNs and CapsNets due to their relevance and current use within
the field.
The research will limit itself to the study of accuracy, with the understanding that a high final
accuracy indicates the ability for optimization in terms of performance and speed on more
The research will only investigate the performance of CapsNets within the TF framework and
will not attempt to reconstruct the design within Caffe or any other ML framework.
The research will limit itself to a few levels of image quality and dimensions with the
understanding that practically applied autonomous driving applications will have similar or
While planning to attempt to identify relatively small traffic lights with artificial neural
networks, the research will set a minimum pixel size at around 4px width according to the futility
The research is limited to the ML area of artificial intelligence and will not examine other areas
G. Importance of Study
By showing the performance of CapsNet technology within traffic light image recognition in
autonomous driving, this research can support or fail to support a shift in resources towards
further CapsNet research. The potential for a more powerful and accurate CNN is very
significant, because CNNs are currently within the forefront of object recognition (Hinton, 2018,
7:00). Improving upon the capabilities of CNNs with CapsNets could change how researchers
approach image recognition problems and push further forward the adoption of autonomous
motor vehicles globally as well as the incorporation of artificial intelligence and ML in common
objects.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 11
Currently, the field of image object detection and recognition within ML is increasing in
importance for a number of different applications. Specifically, Fairfield and Urmson (2011)
discuss its growing significance in the field of autonomous driving, where it has been used to
build perception systems in combination with cameras (p. 1). They specifically cite the issue of
traffic light image recognition, which cannot be performed by alternative measures like sonar or
radar (p. 1), because it requires knowledge of color. As such, a large amount of development has
gone into designing the best learning algorithms for traffic light image recognition problems. So
far, Huang et al. (2017) found that the leading models used are CNNs (p. 1). Lim et al. (2017)
discusses this, describing CNN architecture as one where image data is fed through a series of
deep (convolutional) and pooling layers, as well as a kernel, to extract features for classification
(p. 11). They explain that CNN technology is state-of-the-art, needing only one network to
Despite this, there are still significant issues with the CNN model. One significant
problem Liu et al. (2016) identify is balancing speed performance and accuracy (p. 1). To
alleviate some of this, Liu et al. (2016) suggest SSD (Single Shot MultiBox Detector) – a “deep
network based object detector that does not resample pixels or features for bounding box
hypotheses and and is as accurate as approaches that do” (p. 2). By replacing bounding boxes
proposals with a convolutional filter, Liu et al. (2016) are able to construct a model that operates
at higher frames per second than previous approaches with Faster R-CNN (p. 16). However, in
contrast with Liu et al.’s research, Huang et al. (2017) suggests that Faster R-CNN can be
very small objects, a task SSD struggles with (p. 14). Meanwhile, in the similar field of traffic
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 12
sign image recognition, Lim et al. (2017) took a unique approach to the optimization problem by
combining a Support Vector Machine (SVM) model – an ML system which does not utilize
neural networks – with CNN technology to improve results (p. 2). SVMs were utilized first to
verify the sign and a CNN afterwards to classify the sign (Lim et al., 2017, p. 2). Lim et al.’s
(2017) combination worked out, forming a system able to classify images at real-time with
97.9% average accuracy and with improved accuracy specifically in poor lighting (p. 19). It is
difficult to compare Lim et al.’s (2017) sign model to the traffic light models of Liu et al. (2016)
or Huang et al. (2017), but the improvements of Huang et al. and Lim et al. over Liu et al. in
such a short time frame show the speed of significant advancements occurring within
Outside of the actual model usage, multiple researchers have attempted to make
this are Fairfield and Urmson (2011), who show the ability for mapped traffic lights to improve
detection results within a model (p. 6). By mapping the location of traffic lights against current
location of the vehicle, a network can predict when it should expect to detect traffic lights and
when it should expect not to, reducing false positives and false negatives (Fairfield & Urmson,
2011, p. 6). Ghahramani (2015) takes a more technical approach, exploring the ability for
probabilistic frameworks – models which “make predictions about future data, and take
decisions that are rational given these predictions” (p. 1) – to increase accuracy. Tyukin et al.
(2018) mirrors this by considering the use of multiple ML models within a teacher-student
model, speeding up the training of classification algorithms and improving the universality of
models in application to data (p. 1). They improve on previous work in the field by creating a
framework for the teacher-student model which requires less raw data and training (Tyukin et al.,
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 13
2018, p. 2). Though not implemented within the context of automated driving, the success of the
model within CNN image recognition suggests its potential for the field.
More than anything else, however, the biggest challenge that has been issued against
CNNs is from Hinton (2014), who references their lack of structure as a major flaw with their
performance in handling positional data (1:47). As a way to fix this, Hinton (2011; 2014)
proposes CapsNets, similar to CNNs but with layers loosely replaced with “capsules” (p. 2;
3:09). According to Hinton (2014), capsules would output the likelihood that a feature is present
and “pose” information, which would include a large amount of positional information (3:09).
First, Hinton (2014) claims, capsules would improve massively on the current CNN
practice of max pooling, which reduces the available information in a subsampling procedure
(6:57). CapsNets get rid of pooling completely, instead using coincidence filtering to find
clusters of inputs at high dimensions, removing unwanted background inputs while keeping
useful data (Hinton, 2014, 5:26). Secondly, Sabour, Frosst, and Hinton (2017) point out the
benefits of capsules for the dynamic routing of information, specializing specific capsules for
certain tasks (p. 2). This contrasts with max pooling, which Sabour et al. (2017) states will,
“throw away information about the precise position of the entity within the region” (p. 2),
because it considers multiple input vectors, not just the most active one. These two effects, the
removal of subsampling and the introduction of dynamic routing, could lead to improvements in
a number of fields, including: digit segmentation and separation, like that performed by Hinton
& Ghahramani (2000, p. 1) and Sabour et al. (2017); traffic sign image recognition, like that
done by Lim et al. (2017); and shape analysis, like that described by Hinton (2014, 15:15). In
fact, Kumar, Arthika, and Parameswaran (2018) have already implemented CapsNets in traffic
sign image recognition with strong results: 97.6% accuracy and 0.0311038 loss at the end of
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 14
validation (p. 4546). Unfortunately, it does not seem like CapsNets have yet been applied to the
primary issue of this research, traffic light recognition. Based on the results of Kumar et al.
(2018), however, the CapsNet architecture should have a strong accuracy rating when
Discussion
From the literature, it becomes clear that there are numerous areas for potential
improvement within CapsNets that do not exist in CNNs. These include the elimination of
information loss from down-sampling suggested by Hinton et al. (2011, p. 7) and by Hinton
(2014, 6:55), as well as within dynamic routing between capsules to enable specialization
(Sabour et al., 2017, p. 2). Sabour et al. (2017) goes so far as to state that, “The fact that a simple
early indication that capsules are a direction worth exploring” (p. 9). This supports the
conclusion that CapsNets, if developed at the same level as CNNs have enjoyed, should become
Further Research: Beyond just this research’s exploration of the utilization of CapsNets
in traffic light imaging, research should also be conducted into applying the improvements made
within CNN architecture to CapsNets. As an example of that, the emulation of Fairfield and
Urmson’s (2011) traffic light mapping (p. 1) or Lim’s (2017) utilization of SVMs as a pre-
processing measure (p. 1) within a CapsNet framework could provide valuable evidence towards
Chapter 3: Methodology
Applied Sub-Problem 1: How do CapsNets perform when implemented in traffic light image
recognition?
Need: A trained CapsNet will recognize traffic lights from multiple autonomous driving related
Research Basis: This need provides a good basis from which to begin examination, because it
establishes clear proof of concept from which the CapsNet can improve. As detailed in the
research paper, state-of-the-art image recognition technology has reached the point of greater
than 90% accuracy after a reasonable number of iterations. Hinton (2014) describes how CNNs
have been extensively developed and improved by researchers for many years now (7:02). As
such, 85% accuracy is an ambitious, but reasonable level of accuracy to expect from an emerging
model for learning. Reaching that level supports the idea that there is potential for CapsNet
architecture to improve to the point of replacing CNN architecture in traffic light image
recognition applications in autonomous driving systems, but is not too high a bar for the newer
Independent Variable: The model, build, and design of the machine learning (ML) system that
Dependent Variable: Final accuracy of CapsNet operating on validation data after training.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 16
Type of Design: The research will utilize the Engineering Design Process to implement a
CapsNet ML system within traffic light image recognition. If the design does not meet the
evaluation criteria, another iteration will be introduced until the best possible CapsNet structure
that can be produced within the timeframe is produced. The design process assumes that the
product is possible to construct and that the successful implementations of previous researchers
will cross apply to this work, as well as the assumptions listed previously in Chapter 1.
Type of data: The data is numerical. The final accuracy is a single number taken at the end of
validation from a table of accuracy over iteration, while loss will be measured per epoch as a
residual sum of squares. Final accuracy is the number being used to evaluate the success of the
product, while loss will simply inform the researchers of how the model’s accuracy increased
over training and validation. The data is descriptive, because it is summarizing the success of the
model in classification. It also encompasses the whole scope of the network, not just a sample of
it.
Testing: The model evaluation is done as part of its operation, during the validation section of
the code. This section will test the code against images it has not yet seen, but that are of the
same type as those the model was trained on. The accuracy of the model in recognizing traffic
Analysis: The only way to analyze the model’s accuracy data is by direct greater-than, less-than
comparison of accuracies to previous versions and the evaluation objective, because the network
is built entirely around minimizing loss and increasing accuracy. As such, it would not make
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 17
sense to analyze the model than by any other metric than its own accuracy. A model which
reaches the threshold of 85% final accuracy is successful, while a model which does not is not a
success.
Validity:
− Internal Validity: Internal validity will be increased by the randomization of all possible
assignments within the design process. These include the order the examples are read during
training and validation, which subsamples of data are used for training, which subsamples of
data are used for validation. It will also be improved by the controlling of as many variables
as possible, including the number of steps allowed within training and validation, the dataset
− External Validity: This will be increased by trying to have as universal a coverage as possible
of traffic light images. By incorporating every type of traffic light, the model will be
applicable to almost all of the subject. This can be done by using multiple datasets, as this
research will, and by using datasets with many, diverse sets of images from multiple
− Criterion-related Validity: If the results of the produced CapsNet in traffic light image
recognition are similar to results of other CapsNets produced for traffic sign image
recognition or other perception problems, it suggests that the model is operating correctly in
Reliability:
− Test-Retest Reliability: If the model, run twice on the same dataset, produces the same
− Inter-rater Reliability: If the model, run on two different datasets, returns similar results, it
suggests that the model is operating correctly and is not overfitted to a specific set of data
samples.
Consistency:
Both forms of reliability addressed above and the factors discussed within internal validity can
be applied to measure the consistency of the model. Additionally, because of the linear nature of
Python programs, as long as conditions for testing are controlled, the program should operate in
Feasibility:
Kumar et al.’s (2018) successful construction of a CapsNet model for traffic sign image
recognition (p. 4547) demonstrates the feasibility of the system within artificial intelligence and
Week 1-2: Building on Empirical Examples: The first step is to examine previously implemented
capsule neural networks and convolutional neural networks implemented for similar problems.
By founding the most basic areas of the design from models shown to previously have success,
the research can establish the model on a stable basis from which to start the design process.
employees on GitHub.
Week 3: Implement Data Processing: The first step within both CNNs and CapsNet Frameworks
is the processing of the input data into a format understandable by the neural networks. This is
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 19
achieved through a Python program which reads each pixel of the images from the traditional file
format into a 3-dimensional array of pixel values. Each image will be represented by one array,
with height, width, and RGB making up the 3 dimensions. The final input dimensions would be
HeightxWidthx3. The pre-annotated datasets utilized by this research have this data already
established with labels and ground truths within a JSON, config, or similar file. The program will
read the labelling and truth information from the file and send it to the neural network for the
Week 4-12: Implement, Train, and Test Capsule Neural Network: The second step of the design
process is to create the actual Capsule Network. This network will have the implementation in
Python for both the training and the validation portions of the Neural Network done using the TF
framework. At the end of the validation, there will be implementation to produce and plot an
accuracy and loss curve as well as to record the data into a csv or similar data file. If the
objective of 85% accuracy is not reached, the researchers will analyze further where losses in
performance could have occurred and renovate the CapsNet, iterating the design until 85%
accuracy is reached.
Tools: The entire project is done within Python, a popular machine learning language.
Additionally, the research utilizes the TF Python framework, which implements a large number
of classes, functions, and objects for ML. The TF framework provides simple, pre-implemented
methods for developing the ML algorithm, measuring the change in the dependent variable
(accuracy), and recording the artificial neural networks every few number of steps.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 20
E. Data Tables
CapsNet Accuracy
Epochs
1 2 3 4 5 6 7 8 9
Training
Accuracy (%)
Loss
Testing
Loss
Note: The actual data table will have more than this number of epochs, depending on the
iteration amount chosen and number of data samples. The whole table is not shown for ease of
viewing.
CapsNet Loss
Epochs
1 2 3 4 5 6 7 8 9
Training
Loss (No
Loss
Units)
Testing
Loss
Note: The actual data table will have more than this number of epochs, depending on the
iteration amount chosen and number of data samples. The whole table is not shown for ease of
viewing.
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 21
References
Fairfield, N., & Urmson, C. (2011). Traffic light mapping and detection. IEEE International
Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (2000). Learning to parse images. Advances in
Neural Information Processing Systems. Retrieved from the NIPS Proceedings database.
Hinton, G. E., Krizhevsky, A., & Wang, S. D. (2011). Transforming auto-encoders. Lecture
Notes in Computer Science Artificial Neural Networks and Machine Learning – ICANN
2011,44-51. doi:10.1007/978-3-642-21735-7_6
Hinton, G. E. (2018, April 12) What's wrong with convolutional nets? [Video File]. Retrieved
from https://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-
convolutional-nets
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., . . . Murphy, K. (2017).
doi:10.1109/cvpr.2017.351
Kumar, A. D., Arthika, R. K., & Parameswaran, L. (2018). Novel deep learning model for traffic
sign detection using capsule networks. International Journal of Pure and Applied
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
doi:10.1109/5.726791
Lim, K., Hong, Y., Choi, Y., & Byun, H. (2017). Real-time traffic sign recognition based on a
doi:10.1371/journal.pone.0173317
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. C. (2016). SSD:
Single shot MultiBox detector. Computer Vision – ECCV 2016 Lecture Notes in
Nilsson, N. J. (2010). The quest for artificial intelligence: A history of ideas and achievements.
Ranzato, M. A., Huang, F. J., Boureau, Y., & LeCun, Y. (2007). Unsupervised learning of
doi:10.1109/CVPR.2007.383157
USING CAPSNETS FOR TRAFFIC LIGHT IMAGE RECOGNITION 23
Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A
doi:10.1162/neco_a_00990
Sabour, S., Frosst, N., & Geoffrey, H. E. (2017). Dynamic routing between capsules. 31st
Proceedings database
Stone, P., Brooks, R., Brynjolfsson, E., Calo, R., Etzioni, O., Hager, G., … Teller, A. (2016,
September). Artificial intelligence and life in 2030." One Hundred Year Study on
Tyukin, I. Y., Gorban, A. N., Sofeykov, K. I., & Romanenko, I. (2018, August 13). Knowledge
doi:10.3389/fnbot.2018.00049