You are on page 1of 11

Evaluation of Point Pair Feature Matching

for Object Recognition and Pose Estimation


in 3D Scenes
Martin Rudorfer and Xaver Kroischke

Technische Universitat Berlin


Pascalstr. 8-9, 10587 Berlin
martin.rudorfer@iat.tu-berlin.de
www.iat.tu-berlin.de

Abstract. Object recognition and pose estimation is a key technology for


contemporary and future industrial robotic applications. We implemented a
point pair feature matching method and assess its suitability with respect to
relevant criteria of such robotic tasks. We evaluate the algorithm on three
different types of datasets to investigate the effects of object and scene prop-
erties on the recognition performance. Our implementation yields competi-
tive results, yet we identify several drawbacks. We find that objects which
are likely to be found in industrial environments, with rotational symme-
tries and rather primitive surface geometries, are harder to recognize than
objects with more complex geometries. Also, the computational effort in-
creases drastically when looking for small objects in large scenes. Without
further modifications, the algorithm merely suggests an arbitrary number of
poses for the object in question, without determining how many objects are
actually present in the scene.

1 Introduction
Industrial robots are a fundamental part of todays production systems. The major
operations that robots are applied for are welding, loading and unloading of machines,
packaging and palletizing, and assembling. However, to this day, the use of robots is only
profitable for large production volumes. Due to the sophisticated programming methods,
reprogramming robots is generally conducted by experts and is time- and cost-intensive.
This generally prevents a profitable use of industrial robots for Small and Medium-Sized
Enterprises (SMEs) which commonly have smaller lot sizes and therefore would need to
reprogram their robots more frequently [1].
Current research aims to make the use of industrial robots more attractive for SMEs,
by creating more intuitive programming methods and improving the robots cognitive
abilities. The one key ability we focus on in our work is the recognition of objects to
interact with. We implemented an object recognition system based on the point pair
feature matching proposed by Drost et al. [2]. They outperformed other feature-based
state-of-the-art methods like Spin Images [3] and Tensors [4]. The main contribution of
this paper is the assessment of the object recognition system for industrial robotic tasks,
using three different datasets for evaluation. We aim to identify potential improvements
to further work on.
Section 2 defines relevant criteria for this assessment. In Section 3, we describe the
point pair feature matching along with several subsequently proposed extensions. Our
experiments are presented in Section 4, followed by the discussion of the results with
regard to the defined criteria. Finally, Section 6 concludes the paper with a short
summary and suggestions for further research.

2 Assessment Criteria for Object Recognition Systems


In order to find suitable criteria for the assessment of an object recognition system, we
first give a definition of the term object recognition as used within the scope of this
work. Secondly, we investigate the intended environment of future industrial robotics to
derive the desired criteria.

2.1 Definition of Object Recognition


Object recognition approaches differ in their exact definition of the recognition task.
Aiming for a perception module that enables robotic manipulation, we understand object
recognition as the combined task of (a) recognizing which objects actually are present in
the scene and (b) determining their pose with respect to the sensor. The desired output
of our object recognition system for a given scene is therefore a list of recognized objects
along with their pose in the scene.
Note that by using this definition of the recognition task, we implicitly consider only
previously known objects. However, we do not consider this a too harsh restriction in
the domain of industrial applications, as will be discussed in the subsequent section.

2.2 Object Recognition in Industrial Environments


Based on the definition given above, we now consider the characteristics and peculiarities
of industrial environments by discussing six relevant aspects: model data, sensor data,
object properties, scene properties, recognition performance, and computational effort.
As mentioned before, the model data must be known beforehand. It should therefore
require a preferably low effort to provide the necessary object descriptions. In practice,

2
CAD models are often available while intensity images from various perspectives usually
need to be created manually.
There are several aspects regarding the type of sensor data. Since we consider robotic
manipulation tasks, the sensor range must cover the robots operation space. The object
recognition systems robustness to noise determines the required quality of the sensor
data. Also, the chosen modality has a great influence: while intensity images provide
robust edge and texture features, range sensors directly reveal information about the
objects distance and geometry. Unfortunately, range sensors often fail when scanning
transparent objects or reflecting surfaces, which leads us to the object properties.
Industrial robots typically handle workpieces, product parts, or tools. Although these
can generally have arbitrary shapes, we tend to design rather simple geometries, dom-
inated by edges, planes, and constant curvature surfaces. While they often possess
discrete or continuous rotational symmetries, they mostly do not have a distinctive tex-
ture. The size of the objects can vary to a great degree, since robots come in different
sizes as well. However, one robot will likely handle only objects of a certain size span.
The scene properties represent another relevant criterion. An arbitrary number of
instances of same or different objects can be arbitrarily arranged in a scene, such as for
example in bin picking tasks. One can also expect that some objects might be heavily
occluded. However, we do not consider it necessary to recognize these objects with the
same reliability, since it is likely that due to the occlusion they are not graspable at all.
Naturally, important assessment criteria are the recognition performance and the re-
quired computational effort. The poses have to be retrieved accurately enough to ro-
bustly grasp the objects. At the same time, contemporary off-the-shelf hardware should
be sufficient to allow recognition times of preferably less than a few seconds.

3 Recent Research on Point Pair Feature Matching


Our recognition system uses the point pair feature matching of Drost et al. [2], which
outperformed other feature-based state-of-the-art approaches. This section outlines the
point pair features, gives a short overview of the feature matching method as in [2], and
summarizes relevant extensions that have been proposed in the meantime. For more
detailed explanations please consult the referenced publications.

3.1 Point Pair Features


The point pair features were first presented in [5] and encode primitive geometric rela-
tions between two oriented points. They can be formally described as

F (pr , pt ) = (kdk , (npr , d), (npt , d), (npr , npt )), (1)
where (pr , pt ) is the point pair with the corresponding surface normals npr and npt ,
kdk = kpt pr k is the distance between the reference point pr and the target point pt ,
and (, ) denotes the angle between two vectors. Note that the resulting descriptor is
generally pose invariant and asymmetric, i.e. F (pr , pt ) 6= F (pt , pr ).

3
3.2 Point Pair Feature Matching
The point pair feature matching can be split into two stages: model creation and match-
ing. In the model creation stage, an object description is created for each object that shall
later be recognized in the scene. This description is based on the point pair features of
all possible permutations of model point pairs. The resulting descriptors are discretized
using ddist and dangle as discretization steps. All features are then organized in a hash
table, where the hash keys are built from the discretized feature vectors. The same
feature descriptor can occur multiple times at different parts of the object. Therefore,
a hash cell comprises multiple mapped values, each of which encodes the features pose
with respect to the model. The resulting hash table is the desired model representation.
During the matching stage, every srp th scene point is used as reference point and the
point pair features to all other scene target points are computed. By using the models
hash table representation, they can be matched to corresponding model features, which
in turn suggest certain possible model poses. The pose suggestions are aggregated in one
Hough-like voting scheme for each scene reference point. The maxima from each voting
space are extracted as weighted pose candidates, all of which are subsequently clustered.
The cluster with the highest accumulated weight is then proposed as final estimation for
the objects pose in the scene. In case multiple object instances should be retrieved, the
k highest-weight clusters are used.
In [2], the point pair feature matching has been evaluated on the dataset provided
by Mian et al. [4, 6], which comprises 50 scenes each of which contains four or five toy
objects. For objects with less than 84% occlusion they achieve a recognition rate of
89.2% in under 2 seconds of computation time per object. The recognition rate can
be further increased by using smaller discretization steps, but the computational effort
increases exponentially.

3.3 Impairments and Relevant Extensions


The original method yields competitive results, but there is still room for improvement.
Several extensions have been proposed to eliminate recurring error patterns. These
extensions regard mainly three aspects: (a) utilizing not only surface information but
also edge or boundary information, (b) dealing with ambiguities in the recognition of
rotationally symmetric objects, and (c) tweaking the algorithm to improve accuracy and
computation time.
The point pair feature matching as presented above optimizes the overlap of surfaces
of model and scene. The method therefore becomes prone to errors when objects appear
similar to background clutter. This is likely the case for industrial objects. Primitive
shapes with constant curvatures or planes can often also be found in the background.
E.g., a metal plate could simply be matched to some panel or box. Drost and Ilic
[7] therefore combined depth and intensity data in a multi-modal point pair feature.
They can minimize this error pattern at the cost of requiring additional template images
from hundreds of different viewpoints to construct the model description. In contrast,
Choi et al. [8] extract additional boundary information directly from the point clouds

4
and construct point pair features using the orientation of both boundaries and surface
normals. Different feature types emerge, such as surface-to-surface, surface-to-boundary
and boundary-to-boundary features. All of these features can be treated the same during
the matching process since they share the same structure.
For rotationally symmetric objects, the point pair feature matching yields multiple
pose clusters for the same object instance, according to its number of rotational symme-
tries. At the same time, each of the clusters accumulates only a fraction of the available
votes. This renders the actual object poses less distinguishable from the pose candidates
produced by clutter and noise. However, most works either use only asymmetric objects
for evaluation [2, 9] or do not examine the influence of rotational symmetries [7, 8, 10].
Figueiredo et al. [11] present an approach to improve recognition for objects that are
continuously rotationally symmetrical along a vertical axis, such as a Coca-Cola can or
a wine glass. However, their method can neither cope with discrete symmetries nor with
multiple symmetry axes.
While the original algorithm treats all model features equally, some of the features
might be more expressive than others. Tuzel et al. [10] use a machine learning approach
to learn weight factors. As training data they render scenes of the model with back-
ground clutter. While they report increased recognition rates, their approach is prone to
overfitting [10] and the derived weights are scene-dependent [9]. Birdal and Ilic [9] pursue
a scene-independent approach by exploiting the observability of the model points. They
reduce the likelihood of using potentially hidden points and show that it improves the
accuracy of pose estimation, although not necessarily the detection rate. Additionally,
they segment the scene prior to the recognition to reduce the computation time. This
is particularly helpful in case the scene is much larger than the object to be recognized.

4 Experiments
As shown in the preceding section, there have been many efforts made to test and
improve the original method. Yet, there are several uncovered aspects remaining. No
examination of the effects of (discrete) rotational symmetries has been conducted. More-
over, most works did not directly examine the influence of different types of objects, e.g.
comparing the performance during recognition of objects with either more complex or
rather primitive shapes.
We implemented the method as described by [2] using PCL1 and ROS2 in order to
investigate these open issues. This section first describes the datasets we used and
subsequently presents the experiments and their results.

4.1 Datasets
Our evaluation is based on three datasets. As reference and validation of our implemen-
tation we used the same dataset as in [2]. It has been provided by Mian et al. [4, 6] and

1
Point Cloud Library, http://www.pointclouds.org/
2
Robot Operation System, http://www.ros.org/

5
Table 1: Properties of the three different datasets.
MBO Household Primitives

objects 4 5 5

scenes 50 15 20
object instances
1 2 2
per scene
object geometry complex medium simple
rotational
no no yes
symmetries
Minolta laser simulated simulated
capturing method
scanner Kinect 2 Kinect 2

elliptic prism cuboid


all chicken
brita water filter
clorox
pitcher

hexagonal frustum pyramid cylinder

titltex milk carton(a) (b) (c)

Figure 1: Scenes from (a) MBO [4], (b) Household, and (c) Primitives dataset.

we henceforth call it MBO dataset. Furthermore we compiled two synthetic datasets:


The Household dataset uses objects from [12] and the Primitives dataset comprises ge-
ometrically simple, but symmetrical objects, which resemble typical traits of industrial
products and parts.
The characteristics of these three datasets are summarized in Table 1 and sample
scenes are shown in Figure 1. Note that while the MBO dataset exhibits almost no noise,
the Household and Primitives dataset contain simulated sensor noise with a standard
deviation roughly equivalent to 0.7% and 1.5% of the model diameter, respectively.

4.2 Results
We thoroughly tested our implementation on the MBO dataset. Figure 2 shows that
our results are very similar to the original results from [2]. Thus, we conclude that,
except for minor differences, our implementation is valid. The charts furthermore show

6
recognition rate (1 instance) 1.0 srp 1.0

recognition rate (1 instance)


1
0.8 2 0.8
5
0.6 10 0.6
20
40 parameters
0.4 0.4 srp 1, ddist 0.025
srp 2, ddist 0.050
0.2 0.2 srp 5, ddist 0.050
srp 10, ddist 0.050
0.0 0.0
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 60 65 70 75 80 85 90 95
ddist / diam(M) degree of occlusion in %

Figure 2: Influence of discretization distance ddist , degree of occlusion, and subsampling


of the scene (every srp th point is used as reference point) on the recognition
rates on the MBO dataset.

the effects of crucial configuration settings like discretization distance ddist and scene
subsampling step srp on the recognition rate. The best results are obtained with a
fine discretization and no subsampling. Unfortunately, due to the time complexity of
the algorithm, both parameters increase the computation time, in case of ddist even
exponentially. To balance recognition rate and matching time we chose ddist = 5% of
the model diameter and srp = 2, which results in an overall recognition rate of 89.4%
while matching takes in average 245 ms per scene and object. The right chart of Figure 2
furthermore indicates that the recognition rate drops drastically for objects with more
than 82.5% occlusion, i.e. with less than 17.5% of the total object surface being visible.
In a second experiment, we compared the recognition rates for all three datasets.
Furthermore, since in the Household and Primitives datasets the objects are present
twice in each scene, we could put up an experiment to retrieve both instances. We used
the two highest-weight pose clusters and rated each scene according to the following
rule: 100%, if both object instances are retrieved correctly; 50%, if the first cluster
resembles a correct pose and the second one belongs to either the same instance (e.g.
due to symmetries) or clutter, 0%, if not even the first cluster resembles a correct object
pose. Table 2 shows the results of these experiments. Figure 3 illustrates the recognition
rates broken down per object. This allows to investigate the effects of different object
properties.

5 Discussion
With the insights gained in the previous sections, we can now discuss the compliance
of the object recognition system with respect to the assessment criteria defined in Sec-
tion 2.2.

7
Table 2: Recognition rates, average matching times per object and scene, and pose accu-
racies on the three different datasets. Pose accuracy is given as average trans-
lation error (relative to the object diameter) and rotation error of recognized
objects.
MBO Household Primitives
1 instance 89.4% 96% 78%
2 instances 76% 44%
matching time 300 ms 4600 ms 17100 ms
translation error 0.46% 0.36% 1.00%
rotation error 2.30 1.03 2.80

1.0 1.0
0.8 0.8
recognition rate

recognition rate
0.6 0.6
0.4 0.4
0.2 1 instance 0.2 1 instance
2 instances 2 instances
0.0 all clorox milk spray water
0.0cuboid cylinder elliptic hexagonal pyramid
carton bottle filter prism frustum

(a) (b)

Figure 3: Recognition rates by object when retrieving one vs. two instances in the
(a) Household and (b) Primitives dataset. Note that all refers to an object
name, not to all objects.

First we investigate the influence of object properties. The results indicate that the
recognition rate is generally lower for the primitive objects than for the household ob-
jects. This might be caused by several factors: First, the primitive objects contain
rotational symmetries, second, due to the primitives being generally smaller, the simu-
lated noise has a bigger impact, and third, the point pair features are less discriminative
for simple shapes and geometries, although this is contradicted by the fact that the
household objects are recognized better than the MBO objects.
The recognition rate decreases when trying to retrieve both instances of each scene.
Despite this decrease being worse for the primitive objects, it is not necessarily linked
to the number of rotational symmetries an object exhibits, for there are no big differ-
ences between various objects within each dataset. Also, it is not intuitively graspable
why certain objects are generally recognized better than others (e.g. elliptic prism vs.
hexagonal frustum). We must therefore also take into account properties of the scene.
For example, the mean distance of objects to each other is much bigger in the Household
dataset than in the Primitives dataset, making recognition easier in the former. Also,
the clutter features in the Primitives set are more similar to the object features. This

8
Figure 4: Typical failure cases for the recognition of primitive objects.

can be demonstrated by typical failure cases shown in Figure 4. The algorithm maxi-
mizes surface overlap and the planar surfaces of primitive objects are likely to be found
in the background.
We also demonstrated that the size of the objects with respect to the scene has a big
impact on the computation time (see Table 2). The algorithms complexity with regards
to smaller discretization steps ddist is exponential. Since ddist is relative to the objects
size, the matching time increases drastically for smaller objects contained in large scenes.
Table 2 also shows that we yield very high accuracies for the estimated poses of
recognized objects. These accuracies should be sufficient for the majority of robotic
manipulation tasks. However, the method still has a crucial drawback with regard to
our definition of the object recognition task. The algorithm simply gives a weighted
list of pose clusters, without checking which of the pose clusters actually resemble an
object instance. The use of this method as object recognition system is thus restricted
to applications in which we can assume a certain number of objects to be present in the
scene.

6 Conclusions
In this work we determined assessment criteria for an object recognition system in in-
dustrial robotic grasping applications. We examined our implementation of the method
proposed by [2] with respect to these criteria and found that it is a very promising
approach, yet with several drawbacks. Most crucially, the method is not decisive, i.e.
it does not state whether an object is present or not. It merely suggests a number of
object poses sorted by their assumed likelihood, which practically limits the type of pos-
sible applications. Moreover, objects as we expect them in industrial environments, with
mainly planar surfaces and rotational symmetries, are not recognized as well as finely
structured objects. Also, the computational effort increases drastically when looking for
small objects in large scenes, which makes real-time applications intractable for those
cases. We propose future works to focus on the coping with rotational symmetries, the
speed-up of recognition in large scenes and the recognition of how many objects are
present in the scene.

9
Author Contributions and Acknowledgments
Xaver Kroischke and Martin Rudorfer jointly conceived the work and designed the ex-
periments. Xaver Kroischke researched the state of the art, implemented the algorithm,
and analyzed the experiment data in his Masters thesis. Martin Rudorfer supervised
the thesis and wrote the paper. We thank Jorg Kr
uger and The Duy Nguyen for helpful
discussions and enriching comments.

References
[1] Zengxi Pan, Joseph Polden, Nathan Larkin, Stephen van Duin, and John Nor-
rish. Recent progress on programming methods for industrial robots. Robotics and
Computer-Integrated Manufacturing, 28(2):8794, 2012.

[2] Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. Model globally,
match locally: Efficient and robust 3d object recognition. In Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE International Conference on, pages
9981005, 2010.

[3] Andrew E. Johnson and Martial Hebert. Using spin images for efficient object recog-
nition in cluttered 3d scenes. IEEE transactions on pattern analysis and machine
intelligence, 21(5):433449, 1999.

[4] Ajmal S. Mian, Mohammed Bennamoun, and Robyn Owens. Three-dimensional


model-based object recognition and segmentation in cluttered scenes. IEEE trans-
actions on pattern analysis and machine intelligence, 28(10):15841601, 2006.

[5] Eric Wahl, Ulrich Hillenbrand, and Gerd Hirzinger. Surflet-pair-relation histograms:
a statistical 3d-shape representation for rapid classification. In 3-D Digital Imaging
and Modeling (3DIM), 2003 IEEE Fourth International Conference on, pages 474
481, 2003.

[6] Ajmal Mian, Mohammed Bennamoun, and Robyn Owens. On the repeatability and
quality of keypoints for local feature-based 3d object retrieval from cluttered scenes.
International Journal of Computer Vision, 89(2-3):348361, 2010.

[7] Bertram Drost and Slobodan Ilic. 3d object detection and localization using mul-
timodal point pair features. In 3D Imaging, Modeling, Processing, Visualization &
Transmission (3DIMPVT), 2012 IEEE Second International Conference on, pages
916, 2012.

[8] Changhyun Choi, Yuichi Taguchi, Oncel Tuzel, Ming-Yu Liu, and Srikumar Rama-
lingam. Voting-based pose estimation for robotic assembly using a 3d sensor. In
Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages
17241731, 2012.

10
[9] Tolga Birdal and Slobodan Ilic. Point pair features based object detection and pose
estimation revisited. In 3D Vision (3DV), 2015 IEEE International Conference on,
pages 527535, 2015.

[10] Oncel Tuzel, Ming-Yu Liu, Yuichi Taguchi, and Arvind Raghunathan. Learning to
rank 3d features. In Computer Vision, European Conference on, pages 520535,
2014.

[11] Rui Pimentel Figueiredo, Plinio Moreno, and Alexandre Bernardino. Fast 3d object
recognition of rotationally symmetric objects. In Pattern Recognition and Image
Analysis, Iberian Conference on, pages 125132, 2013.

[12] Aitor Aldoma, Federico Tombari, Luigi Di Stefano, and Markus Vincze. A global
hypotheses verification method for 3d object recognition. In Computer Vision,
European Conference on, pages 511524, 2012.

11

You might also like