Professional Documents
Culture Documents
Abstract. The increasing use of fast and efficient data mining algorithms in
huge collections of personal data, facilitated through the exponential growth of
technology, in particular in the field of electronic data storage media and
processing power, has raised serious ethical, philosophical and legal issues
related to privacy protection. To cope with these concerns, several privacy
preserving methodologies have been proposed, classified in two categories,
methodologies that aim at protecting the sensitive data and those that aim at
protecting the mining results. In our work, we focus on sensitive data protection
and compare existing techniques according to their anonymity degree achieved,
the information loss suffered and their performance characteristics. The ℓ-
diversity principle is combined with k-anonymity concepts, so that background
information can not be exploited to successfully attack the privacy of data
subjects data refer to. Based on Kohonen Self Organizing Feature Maps
(SOMs), we firstly organize data sets in subspaces according to their
information theoretical distance to each other, then create the most relevant
classes paying special attention to rare sensitive attribute values, and finally
generalize attribute values to the minimum extend required so that both the data
disclosure probability and the information loss are possibly kept negligible.
Furthermore, we propose information theoretical measures for assessing the
anonymity degree achieved and empirical tests to demonstrate it.
Keywords: Privacy Enhancing Technologies, SOM, k-anonymity, l-diversity
1 Introduction
Data contained in databases may be personal data, i.e. information that directly or
indirectly identifies an individual, as for instance an address and date of birth that can
be linked with public available datasets and background knowledge and reveal the
identity of an individual. Such a set of attributes is called Quasi-identifier (QI) set.
Data-mining a database can lead to the disclosure of personal data and the
identification of data subjects, i.e. persons the data refer to. But on the other hand
exploiting such databases may offer many benefits to the community and support the
policy and action plan development process, as for instance in case of pandemic. To
address these at first sight contradicting requirements, privacy preserving data mining
techniques have been proposed [1, 2, 3, 4, 5, 6, 10].
Existing privacy-preserving data mining algorithms can be classified into two
categories: algorithms that protect the sensitive data itself in the mining process, and
those that protect the sensitive data mining results [1]. The most popular algorithms in
the data mining research community address k-anonymity and ℓ-diversity. They
belong to the first category and apply generalization and suppression methods to the
original datasets in order to preserve the anonymity of individuals or entities data
refer to.
K-anonymity requires each tuple in the published table to be indistinguishable from at
least k-1 other tuples [2]. Tuples with the same or close QI values form an
equivalence class. However, k-anonymity cannot protect against homogeneity and
background knowledge attacks [3]. To address these shortcomings, the l-diversity
principle was proposed [3], which requires that different values of the sensitive
attributes are well represented in each equivalence class, thus preventing an attacker
from guessing the sensitive attribute value for a QI set with probability greater than
1/ℓ Distinct ℓ-diversity requires that for each equivalence class ei, there are at least ℓ
distinct values in ei[S], where ei[S] is the multi-set of ei ’s sensitive attribute values [2,
3].
In our work, we use the Adult data set provided by Irvine machine learning repository
[4], so that our research results can be compared with those presented in the literature
(see section 2), since this database has been used widely in classification experiments.
It consists of 30162 complete records, with 6 numerical and 8 categorical attributes.
In [5] two greedy algorithms are proposed. The first is clustering-based and conducts
a bottom-up search, while the second one is partition-based and works top-down. The
selection criterion for an attribute to be merged in an equivalence class is the weight
certainty penalty (NCP). By using this criterion, information loss and record
importance are taken into account. In bottom-up search, at the beginning of the
anonymization process, each tuple is being treated as an individual group. Each group
whose population is less than k is being merged with another group such that the
combined group has the smallest NCP. It iterates until every group has at least k
tuples. In the end of the process, each group that has more than 2k tuples is being split
into such that each group has at least k tuples. In the top-down approach, in the
beginning, the two tuples that cause the highest NCP in case they are merged in the
same group, are being selected and form the two initial groups Gu, Gυ. Then, the
other tuples are being assigned to these groups randomly. The assignment of a tuple w
depends on the NCP(Gu,w) and NCP(Gυ,w), where Gu, Gυ are the groups formed so
far. Tuple w is assigned to the group that leads to a lower NCP. The procedure of the
partitioning is being conducted recursively while the group has k or more tuples. If
one group G has less than k tuples then a group with population greater than 2k-|G’|is
being searched. Then from the group that has been formed, G’= (k-|G|) tuples are
being selected such that NCP (GUG’) is minimized.
In [6], the algorithm starts with a fully generalized dataset, one in which every tuple is
identical to every other, and systematically specializes the dataset into one that is
minimally k-anonymous. This algorithm uses a tree search strategy to find the optimal
solution. An optimal solution is an optimal generalization with the least information
loss and the highest privacy preserving. Considering that this technique can involve
scanning and sorting the entire dataset, it may produce an enormous solution space.
So it uses pruning strategies to reduce the solution space and a dynamic search
rearrangement tree search algorithm named OPUS [7]. Opus extends a systematic set-
enumeration-search strategy [8] with dynamic tree rearrangement and cost – based
pruning for solving optimization problems. A node can be pruned only when the
algorithm can determine that none of the descendants or the node itself could be
optimal solution. For this determination a lower bound cost must be computed for any
node within the subtree rooted beneath it. If this lower bound exceeds the current best
cost, the node is pruned. To compute the lower bound cost it uses the discernibility
metric and classification metric [6].
[9] proposes a genetic algorithm to find the optimal anonymization. Every possible
anonymization is being coded and represented with a chromosome. Then, based on
the Genitor algorithm [11], is trying to find the optimal solution, that is the
chromosome with the best evaluation value. For the evaluation it uses the criterion of
the weighted certainty penalty [5]. Also, the generalizations must be consistent with
the restrictions set out in valid generalization notion that was mentioned in section
3.2.a.
step 1: The tuples of the dataset are bucketized according to their SA values to Bi
buckets.
step 3: Randomly one tuple from the first bucket B1 is selected and creates an
equivalence class e .
step 4: From each of the next ℓ-1 Bi groups one tuple is selected that minimize the
information loss according to NCP metric and incorporated to e .
step 5: While there is a bucket with more than ℓ tuples, steps 1 to 4 are being
repeated.
BSGI which was inspired from “Anatomy” [12] implements ℓ-diversity by firstly
“bucketizing” the tuples according to their SA values and then “greedy” group them
into equivalence classes depending on the similarity to their QI attributes. As it was
mentioned on section 4, it randomly selects a tuple from the largest bucket and tries to
find ℓ-1 other tuples from the next largest ℓ-1 buckets. Assuming that some “better”
tuples belong to other |D|- ℓ buckets then this technique introduces a limitation with
possible information loss.
Finally, the total weight certainty penalty NCP(T) that mentioned in section 1 and the
discernibility metric CDM mentioned in section 2 are computed for the evaluation of
the algorithm.
4. Coding
Domain Hierarchy
The generalization process of the categorical attributes adopts the model that
represented in [9]. It is based on domain generalization hierarchy [10] and extends by
setting the restriction of the valid generalization.
The domain ordering must be supplied by the user. This ordering should correspond
to the order in which the leaves are output by the preorder traversal of the hierarchy.
According to [9] “a generalization A is represented by a set of nodes SA in the
taxonomy tree and it is valid if it satisfies the property that the path from every leaf
node Y to the root entounters exactly one node P in SΑ . The value represented by the
leaf node Y is generalized in A to the value represented by the node P.”
Each value domain is denoted with the least value belonging to the interval of the
generalization interval. Even more, values inside a value domain must be ordered.
Then, this technique imposes a total ordering over the set of all attribute domains such
that the values in the ith attribute domain (Σi) all precede the values in any subsequent
domain (Σj) for j>i). The least value from each value domain is being omitted. So, the
empty set {} represents the most general anonymization in which the induced
equivalence classes consist of only a single equivalence class of identical tuples.
Adding a new value to an existing anonymization specializes the data while removing
a value generalizes it.
Chromosomes
Each chromosome is formed by concatenating the bit strings corresponding to each
potentially identifying column. If the attribute takes numeric values then the length of
the string that refers to this attribute is proportional to the granularity at which the
generalization intervals are defined. A string representing a numeric attribute formed
according to the intervals of the generalization. The bit string for a numeric attribute
is made up of one bit for each potential end point in value order. A value of 1 for a bit
implies that the corresponding value is used as an interval end point in the
generalization [9]. For example if the potential generalization intervals for an attribute
are
Then the chromosome 100111 provides that values 0,60,80,100 are end points, so the
generalized intervals are [0,60](60,100].
For a categorical attribute with D distinct values which are generalized according to
the taxonomy tree T, the number of bits needed for this attribute is D-1. The leaf
nodes which are representing the distinct values are arranged in the order resulting
from an in-order traversal of T. Values of 1 are assigned to the bits of the
chromosomes that are between to leaf nodes and represents that those to leaf nodes
are separated in the generalization. Because some of the newly chromosomes may not
be valid, an additional step to the Genitor algorithm modifies them into valid ones.
]. Discernibility metric assign a penalty to each tuple based on how many tuples in the
transformed dataset are indistinguishable from it. This can be mathematically stated
as follows:
2
C DM ( g , k ) = ∑ |E| + ∑ | D | | E | (3.1)
∀Es.t .|E|≥k ∀Es.t .|E|<k
where |D| the size of the input dataset, E refer to the equivalence classes of tuples in D
induced by the anonymization g.
Classification metric assigns no penalty to an unsuppressed tuple if it belongs to the
majority class within its induced equivalence class, while all the other tuples are
penalized a value of 1. More precisely:
CCM ( g , k ) = ∑ ( )
minority ( E ) + ∑ E (3.2)
∀ E s.t .|E| ≥ k ∀ E s.t .|E| < k
where E is the equivalence class and minority function accepts a class of equivalence
argument and returns all those records which are in the minority class with respect to
the sign class. The first sum gives a penalty to those records which have not been
suppressed, while the second one penalizes suppressed tuples.
6. Conclusions
Nume- categorica
ric l
work class, bottom-
Utility- marital-status, Clustering
k- up
Based Age, occupation,
anony- Greedy
Anonymi- education race, gender,
mity
zation native- top-down Partitioning
country
Data work class,
Privacy marital-status,
k- Exhaustive
Through Age, occupation, Heuristic depth-first tree
anony- search
Optimal k- education race, gender, search
mity
Anonymi- native-
zation country
work class,
Transfor-
marital-status,
ming Data to k-
Age, occupation,
Satisfy anony- Genetic
education race, gender,
Privacy mity
native-
Cons-trains
country
ℓ- Age, final- marital-status,
BSGI Greedy clustering
diversi- weight, race, gender
ty education,
Hours per
week