You are on page 1of 2

COB Colloquium

4-29-14
Justin Jee
Speaker: Constantin Aliferis
Title: Frontier Problems in Feature Selection for Big Data Analytics
What is big data? Is it defined by the 3 Vs: Volume, Velocity, and
A(V)ailability? Well, perhaps. Aliferis presents a model in which big data
is really correlated with global or population level data with millions of
dimensions, as opposed to small data which is clinical or
demographic data based n selected biomarkers. There are many facets
to analysis of such data, including filtering, correlation, etc. in trying to
develop a predictive model. However, such approaches are widely used
in advertising, finance, and numerous other industries.
The methods boil down to regularization, dimensionality reduction,
model selection parsimony criteria, feature construction, and most
important feature selection. The typical formulation of the feature
selection problem for supervised learning is to find the smallest set of
variables that yields maximum predictivity for the response T. A related
way of stating the problem is via irrelevancy. i.e. if variable A is
irrelevant for T we can safely drop it from further consideration and we
will not compromise our ability to predict T.
Aliferis defines terms. Strong relevancy: If variable A is strongly
relevant for T, we should never drop it from consideration, otherwise
we will compromise our ability to predict T (unique information). Weak
relevancy: contain non-unique information, and dropping weakly
relevant variables from consideration will not compromise the ability to
predict T. Wrappers try to solve feature selection by searching in the
space of feature subsets and evaluating each one with a user-specified
classifier and loss function estimator. Filter feature selection looks at
properties of the data (not a classifier). There is no filter or wrapper
that works universallythese must be tailored to the particular loss
function.
He notes that correlations are not synonymous with causations. He
connects the two using the Markov Blanket. The markov boundary, the
set of variables that renders all other variables irrelevant, of T is the
optimal solution to the feature selection problem. However, a priori
these are rare. Computational causal discovery is an old field. Aliferis
describes advances in path analysis, structured equation modeling, the
work by Pearl, Verma, and others, etc.

Common techniques for feature selection include univariate filtering,


heuristic search, forward-backward stepwise procedures, and PCA.
Aliferis quickly goes through an empirical comparison between over
100 different algorithms. There was never a case other than markov
blanket selection in which a single algorithm dominated for all cases.

You might also like