You are on page 1of 6

PERSPECTIVE Emerging Topics and Challenges for Statistical Analysis and Data Mining

Arnold Goodman Collaborative Data Solutions, 18231 Hillcrest Circle, Villa Park, CA 92861, USA

Received 16 November 2010; revised 4 January 2011; accepted 4 Janury 2011 DOI:10.1002/sam.10107 Published online 13 January 2011 in Wiley Online Library (wileyonlinelibrary.com).

Abstract: The Preview integrates a sample of emerging topics in statistical analysis and data mining which I collected at 4 relevant 2010 meetings and major challenges which correspond to them. They are grouped into Mixtures of Problem Data, Methodologies for Solutions, Challenges for Statistical Analysis and Data Mining, and Challenges Suggested during Review. This Preview is meant to be a provocative overview, but it is not meant to be a comprehensive or denitive research article. A vision of the near future in any area is there for us to perceive when we look for it carefully. It has the advantages of short term predictions and its disadvantages may be overcome by checking against current literature. In times of cascading change, such an informed Preview may be even more valuable than the far more typical Reviews of what has already been accomplished that are provided by most Journals in overabundance. Now let us welcome the new decade full of things that have never beenimagine the possibilities! 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 38, 2011 Keywords: data mining; statistical analysis; complex data; massive data; complex problems; massive problems; emerging challenges; emerging topics; research to monitor; research to pursue; collaborations to pursue; collaborators to pursue; things to think about

1. INTRODUCTION Among the most popular articles in almost all journals are Reviews of what has already been accomplished. Few if any journals devote sufcient content to an informed Preview of which topics are likely to be pursued in the near future. In times of cascading change, Previews may be even more valuable than the far more typical Reviews. They provide an early glimpse of opportunities to seek new understanding, plan future research, or nd collaborators for any future research selected. A vision of the near future in any area is there for us to perceive when we look for it carefully. According to Peter Drucker, the future may be envisioned if we look through the correct window, and we can then predict the future by helping to create it. Helen Blau observed that, What we see depends upon where we look and how we look. Experience leads me to believe that What we see also depends upon when we look, where we look from, why we look, and how well we look. My Preview integrates a nonrandom survey of emerging topics which was collected from experts at the 2010
Correspondence to: A. Goodman (datagoodman@att.net) 2011 Wiley Periodicals, Inc.

SIAM Data Mining Conference, 41st Symposium on the Interface of Computing Science and Statistics [1,2], Workshop on Climate Prediction of the UCLA Institute for Pure and Applied Mathematics, plus Joint Statistical Meetings. Sources for the emerging topics are given by the associated Ref. I invite those who suggested emerging topics to submit 10005000 word summaries of these topics, and some of the summaries may be appropriately grouped into future invited Perspectives. I also include several informed conjectures of my own, based upon my computing and data mining adventures since 1961. This pioneering Preview is meant to be a provocative overview of emerging topics and challenges in statistical analysis and data mining. It is not meant to be a comprehensive and denitive research article and, as such, it is similar to the recent New York Times Feature on Voices: Whats Next in Science [3]. I intend the Preview to be a beginning rather than an end in itself, so I invite others to build upon and reference it by submitting their own suggestions of emerging topics and challenges for other potential Perspectives. Since detailed explanations of accepted technical terminology would be prohibitive and detract from the ow and sufcient readability of Preview,

Statistical Analysis and Data Mining, Vol. 4 (2011)

please search the internet or consult available reference books for any terminology which happens to be unfamiliar. Emerging topics are grouped into Mixtures of Problem Data within Section 2 and Methodologies for Solutions within Section 3. Challenges for Statistical Analysis and Data Mining are posed by Section 4, while Challenges Suggested during Review are posed by Section 5.

dimensionality, data coming from multiple studies, and data interactions [13]. A third example is the reduction of dimension for massive datasets where the number of variables is itself massive [14]. A nal type of unrelated and unusual data is rarely included in current data analysis. Symbolic data [15] involve lists and intervals, histograms and distributions, plus capacities and necessities. Symbolic data analysis is discussed in Section 3.

2. MIXTURES OF PROBLEM DATA Mixtures of categorical and numerical data [4] plus general heterogeneous data [5] have long been neglected. Mining for natural experiments in a massive database [6] is an unusual new data mixture. Its goal is to nd the subset of a massive database that might correspond to a designed experiment, where the controlled variables have a clear response and any other potentially inuential variables are approximately constant. A good technique is rst to nd simple relationships which appear to hold in a subpopulation, and then to nd what other inuential variables are needed to explain this response over an entire population. If inuential variables happen to be approximately constant over a sample representing this subpopulation, then the sample qualies to be a natural experiment. These results generalize under a wide variety of circumstances. Mixtures of data series over space and time, which pose a more difcult opportunity, may be either multivariate [4,7] or nonstationary [8,9]. In addition, a new yet important opportunity is the current deluge of real time security data streams [10]. Real time security data streams must be analyzed while they are both available and productive. Although cloud computing is likely to be the technology of choice, it has its own inherent security risks. Some pertinent examples are unwanted and unwarranted process access, interference, interruption and/or intervention. Ability to trace perpetrators may provide insufcient satisfaction. Data distributed across a network [11] is also about to emerge. How do we optimize analysis over hardware, software, and information systems architectures within our constraints on cost and time? Which data are to be centralized, which analyses are to be distributed, and how are information transfers to be optimized? Increasingly more complex data mixtures are: unstructured text and image data, multimedia image and video data, and large network and graph data [4]. The interplay between biology and genetics generates both complex and massive mixtures of data which often overwhelm us. Data where number of parameters greatly outnumbers even the massive sample size [12] is an example. Others are data complexity coming from high Statistical Analysis and Data Mining DOI:10.1002/sam 3. METHODOLOGIES FOR SOLUTIONS

Symbolic data analysis [15] is novel, yet much more novel in the United States than it is in Europe. Data visualization may be combined with sampling and robust methods within a single computing environment to drive an introductory statistics course [16]. Symbolic data analysis might well provide support to courses such as the one just mentioned by Welsch [8]. Seriation [17] is a combinatorial data analysis to reorder series of objects along a one-dimensional continuum, so that they best reveal patterning and regularity among the whole series. It is exploratory in nature and has applications within archeology and anthropology, bioinformatics and biology, cartography, cellular manufacturing, ecology, graphics and visualization, operations research, psychology and psychometrics, plus sociology and sociometrics. A fundamental problem is to construct an effective visualization of partial differential equations employed in climate and weather modeling [18] (or elsewhere). This visualization should speak to those analyzing problem data, while preserving essential climate and weather scales as well as any other pertinent information over both space and time. The exploding area of computational statistics and human behavior is dominated by new forms of massive data from individuals and social networks, mechanistic and statistical modeling, plus greatly increased computing power. Modeling efforts for such data [19] include model calibration to data, models for probabilistic- and populationprojection, agent based models, models for transportation and urban planning, models for risk and threat assessment, infectious disease models, plus models for game theory. Modeling our uncertainty [19] is important in prediction, estimation, and decision making. Examples are uncertainty of model inputs, uncertainty of model structure, and uncertainty due to random variation. Competing models may be compared using Bayes factors, while models may be validated with Bayesian predictive distributions. There is more on modeling uncertainty in Section 4. Social and other networks are badly in need of predictive models. Statistical relational learning [20] is a growing

A. Goodman: Emerging Topics and Challenges

data analysis area which applies data mining languages and algorithms to statistical problems. It has a number of existing applications, and is an example of data miners using data mining methodology to solve a standard problem of statistical analysis in an innovative manner. Targeted internet marketing, from awareness through consideration to purchase [4], may be a boon to marketing. Marketing has historically been about evenly split between the disparate cultures of brand advertising for consumer awareness and direct marketing for consumer purchase. Internet data richness has created a huge new arena of opportunities to reach consumers during consideration. Observing consumers in their transformation from an initial awareness through consideration to the nal purchase may well show marketing and advertising are not really unrelated. An opportunity for new methodology is to effectively model and then analyze consumer behavior during such a transformation. However, there are both political and technical movements in progress to allow one to opt out of being tracked through his or her browsers. Analytics is the utilization of data mining and statistical analysis within an information system on problems with complex and massive databases. There are two emerging trends within analytics. Predictive analytics are hot and getting hotter [21] and this prompted Masters Degree Programs in Analytics [21] to have already been established at Cornell, North Carolina State, Pennsylvania State, Purdue and other Universities plus the Rensselaer Polytechnic Institute. Finally, the Stanford Center for Professional Development offers online graduate courses in statistical analysis for data miners that can lead to certication. Genetics research with complex and massive databases has greatly exploded since the Human Genome Project. Data mining in genetics (D. Heckerman, personal communication) has not only emerged because of this, but is expanding and growing. It is the data mining encore to statistical analysis in genetics, when statisticians Karl Pearson and R. A. Fisher presided over the birth of modern genetics in the early 1900s. I complete this section with informed conjectures that many statistical approaches may support data mining. Why would experimental design not be useful with sampling from massive databases, as in mining for natural experiments in a massive database ? Why would sequential analysis not be useful to determine when to stop such sampling? Why would meta analysis not be useful with its view from well above the data? Why would nonparametric analysis not be useful due to its minimal need for data assumptions? Why would mixed models not be useful with their recursive estimations? Why would seriation and symbolic data analysis (in Section 3) not be helpful in their generality?

An interesting and potentially productive exercise may be to pose the potential applicability of these conjectures to each data mining application. There are many data mining approaches which might support statistical analysis, and statistical relational learning in Section 3 is clearly one of them.

4. CHALLENGES FOR STATISTICAL ANALYSIS AND DATA MINING I have been urging statisticians to broaden their horizons ever since 1972 Fall Joint Computer Conference [2224]. I attended KDD97, 98, and 2001 as well as SIAM Data Mining Conference (SDM) 2004, 2005, and 2010. I organized Best of KDD or SDM Sessions for Interface 1998, 1999, 20012006, and 2010. I also invited Tom Dietterich, Pedro Domingos, Charles Elkan, Usama Fayyad, and David Heckerman to them. Padhraic Smyth agreed to cochair Interface01 on Data Mining and Bioinformatics with me. I have also been urging data miners to widen their horizons ever since KDD-97, by inviting KDD 2001 to meet jointly with Interface01 and then proposing an invited session on statistical analysis at KDD 2002 (without any success). However, Chandrika Kamath and Vipin Kumar invited me to cochair the SIAM Data Mining Conference for 2005 with them and then to cofound this collaborative Journal. It is worth noting that our discussion during a UCI luncheon, organized in conjunction with Interface01, was instrumental in locating the Universitys rst Statistics Department within its brand new School of Information and Computer Science. The Schools current Dean, Hal Stern, was founding Chair of that Statistics Department and is, in fact, the countrys very rst Interface Dean. For 15 years, I have presented factors which increase project quality and success, and they gradually evolved into the collaboration and value creation processes of success factors for problem solving, product development, and project management. Value creation was partially motivated by my desire to pave a path for data miners to successfully transition from their explorations in data mining to the promised discovery of knowledge. I now challenge both statisticians and data miners to implement collaboration and value creation processes across their complementary approaches to data analysis. Data miners and statisticians should collaborate through its success factors of connecting across disciplines and attending each others meetings, communicating across disciplines and coauthoring joint papers, contributing across disciplines and developing mutual analyses, committing to common analysis goals, changing behavior and submerging into an analysis community, in addition to challenging their Statistical Analysis and Data Mining DOI:10.1002/sam

Statistical Analysis and Data Mining, Vol. 4 (2011)

own cultures and environments to promote analysis success. This should encourage the implementation of value creations success factors of dening actual stakeholder needs, identifying what resources should be, designing analysis tools, satisfying actual stakeholder needs, checking results with meaningful tests, and creating real stakeholder value from these results. Since the achievement of common challenges is typically advanced by a difcult yet signicant goal to be pursued together, I propose that statisticians and data miners jointly undertake characterization of the likely uncertainties within well funded searches for relationships among diseases and genes. Although scientic dogma is productive in understanding Nature, the dogma of genetic determinism continues to exhibit signs of becoming obsolescent [25] and may be impeding this research as much as supporting it. Analyzing the massive databases of successful and unsuccessful genetics experiments might also help to bridge the huge gap between classical biology results and systems biology results. Inspired by Paul Silvermans nal vision in 2004 [26], we introduced our rst process model for the cellular protein cycle (from genes in DNA through various RNAs to heritable gene expression), and suggested that an uncertainty component be added to the biological dogma of genetic determinism [27] in 2005. Paul also believed that allowing for uncertainties in it would enrich, not detract from, genetics research. The philosopher Democritus (Everything existing in the universe is the fruit of chance and necessity.), mathematician Gottfried Liebniz (Nature has established patterns originating in the return of events, but only for the most part.), and biologist Jacques Monod (Chance and Necessity ) have essentially characterized Nature as what is now called a dynamical system with an uncertainty component. Quantum physics, the social sciences, and most other disciplines have long accepted and analyzed uncertainties within their data, in order to capitalize upon them instead of ignoring them. As we analyze data for Nature s scientic truths, we undoubtedly introduce our own variations that likely carry uncertainties not only in the selection of assumptions and models to utilize them, but also in the selection of analyses to employ on those model s. We place our stakeholders and ourselves in jeopardy as long as we continue to ignore such sources for uncertainties, which may cost us dearly. We also place our stakeholders and ourselves in harms way when we analyze the data beyond what they will support and beyond what statistical analysis will ensure. In such instances, we should be extremely careful when the complexity and nonconformity of our problems force us to pursue analytic challenges and opportunities which lie beyond the protective insurance of statistical analysis. Statistical Analysis and Data Mining DOI:10.1002/sam

I owe this cogent observation to Bradley Efrons recent presentation [28]. Are we not obligated to ensure that our analyses and results can be comfortably applied to data in related situations and similar environments [6,7,8,9]? Do we think that merely evaluating our results, just with the data at hand and only inside the analysis at hand, provide us with sufcient insurance to successfully use them in related complex situations that are now routinely faced in the 21st Century? I have been advocating this challenge since 1999, have incorporated it into the checking result s factor of my value creation process discussed above, and totally realize both its difculties and its rewards.

5.

CHALLENGES SUGGESTED DURING REVIEW

While this Preview was being reviewed, Innar Liiv made me rst aware of data journalism [29] in particular and then aware of visual analytics [30] in general. My interpretative summary of Ref. [30] follows, in order to relate it to all previous emerging topics and challenges. Visual analytics is the visual face of analytics, which enables and facilitates collaborative statistical analysis and data mining within an information system on problems with complex and massive databases. An extremely early ancestor (if not the rst) of visual analytics is Chernoffs Faces [24] from four decades ago. Objectives are to facilitate understanding of massive and continuously growing collections of data mixtures that may be incomplete, spacial, temporal, or uncertain and have levels of abstraction. This should also involve task adaptable representations to guide users and enable awareness of situations, while supporting knowledge discovery and actions through information synthesis based upon meaning rather than original form. The rst approach to accomplishing those objectives is to generate a taxonomy of interactions, from simple (e.g., correlation) to complex (e.g., causality ), which scales across different types of displays and tasks. Secondly, we need to develop a theory and practice of transforming data into new scalable representations that faithfully represent content of the data. The third approach is to create methods for synthesizing (or fusing) information, of differing types and from differing sources, into a unied data representation in order to focus upon meaning of the data. Next two approaches are to be capable of measuring data quality, reliability and uncertainty throughout the data transformation and analysis (checking results in value creation ) as well as producing methodology and tools to capture an assessment of the analysis, with its recommendations for decisions and actions, tailored to both receivers and situations plus supporting evidence as needed (creating value in

A. Goodman: Emerging Topics and Challenges

value creation ). Approach six is to obtain methodologies and technologies which enable analysts to communicate what they know through appropriate visual metaphors, accepted reasoning principles, meaningful graphical representations and mobile media forms, in addition to supporting both situation assessments by rst responders and basic handbooks for public awareness and alerts (communicating in collaboration ). Finally, develop a cohesive and integrated hardware and software environment for successful data collection and storage plus an analysis to enable and facilitate objectives, approaches, decisions, and actions. An astute reviewer then suggested emerging topics and challenges to also be included here. I believe that it is far more appropriate to introduce them now, and then cover them more completely in some future Preview perhaps as coauthors. Those emerging topics and challenges address the important needs for: storage and computing hardware environments to handle massive data and methodology demands, statistical software upgrades tuned to these computing and storage hardware environments, evolution of statistical methodology beyond the classical framework of R. A. Fisher, Jerzy Neyman and Egon Pearson as well as enriched statistical education tailored to what we have learned and shall learn from emerging topics and challenges. Parallel multicore-memory hierarchies [31] are coming to a hardware environment near you, with algorithms and supporting software following them. Prepare your thinking and your computer backups for such revolutionary changes, because they may be accompanied by surprising problems with an accompanying level of frustration. Dedicated hardware and software packages [5] may yield a remedy for it. I conjecture that imbedded software (or what used to be called rmware ) might accomplish the same purpose. Our familiar software systems such as SAS, SPSS, S+, and R will need to adapt themselves to our emerging storage and computing hardware and software environments. Although exploratory data analysis and problem solving have been driving our analysis for quite some time, we now also need a new generation of mathematical statistics to develop methodology for rescuing us from the dilemma suggested by Bradley Efron and posed by me above. We need to recreate a productivity such as that exhibited by mathematical statistics from the 1920s through the 1970s. In the meantime, genuine collaboration combined with value creation (especially its success factors of dening needs and checking results ) may perhaps maintain and increase the support of our stakeholders. Another issue posed by our emerging topics and challenges is how might their product contribute to enrichment of statistical education ? They could provide raw material for motivating and stimulating class discussions at all levels, annotated literature searches including

their conjectures at Masters level, and likely even Ph.D. dissertations. All levels, beginning with graduate students, should be introduced to the successful practice of collaboration and value creation as preparation for their 21st Century careers. Have we been sufciently preparing our students for their problems, experiences and careers or have we been preparing them merely for our own problems, experiences and careers? Now let us welcome the new decade full of things that have never beenimagine the possibilities!

ACKNOWLEDGMENTS I am extremely grateful to Jon Kettenring and David Heckerman for their generous guidance during my collection of emerging topics and for their contribution of personal emerging topics. In addition, Jon has helped immensely in the Previews development and renement, even up to the nal stages of our editorial review process. My prize for suggesting by far the largest number of emerging topics on statistical analysis and data mining goes to the data miner Usama Fayyad. Recognizing the 10th Anniversary of UCLAs Institute for Pure and Applied Mathematics, I sincerely thank it for providing the workshops which allowed me to meet and later collaborate with Chandrika Kamath and Vipin Kumar, and to subsequently meet and collaborate with Cl audia Bellato. The former introductions produced this pioneer Journal, and the latter introduction produced an initial form of rst process model for protein cycle from DNA through RNAs to heritable gene expression [27].

REFERENCES
[1] A. F. Goodman, Computers and statisticsevolution of the interface, Encyclopedia of Computer Science and Technology, Vol. 6, Marcel Dekker, New York and Basel, 1977. [2] A. Goodman, PERSPECTIVE: brief history of the interface of computing and statistics which preceded data minings birth, Stat Anal Data Mining 1(1) (2008), 5456. [3] http://www.nytimes.com/interactive/2010/11/09/science/ 20111109_next_feature.html? ref=science (2010) [Last accessed on November 15, 2010]. [4] U. Fayyad, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [5] J. Han, 2010 SIAM Data Mining Conference, Columbus, Ohio, 2010. [6] J. Verducci, 2010 SIAM Data Mining Conference, Columbus, Ohio, 2010. [7] V. Kumar, 2010 SIAM Data Mining Conference, Columbus, Ohio, 2010. [8] A. Goodman, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010.

Statistical Analysis and Data Mining DOI:10.1002/sam

Statistical Analysis and Data Mining, Vol. 4 (2011) [23] A. Goodman, Data modeling and analysis for usersa guide to the perplexed, In AFIPS Conference Proceedings, Vol. 41, Part II: 1972 Fall Joint Computer Conference 19723 Years prior to the Formation of Computer Measurement Group, Primary Professional Society. [24] A. Goodman, Career Insurance for the 21st Century (In My Invited Session on Statistics and Statisticians in the 21st Century), In Computing Science and Statistics, Vol. 29(1): Mining and Modeling Massive Data Sets in Science, Engineering and Business with a Subtheme in Environmental Statistics, 1997. [25] G. Babbitt, Chromatin evolving: despite our long familiarity with the chromosome, much about its function and evolution remains a mystery, American Scientist 99(1) (2011), 4855. [26] P. H. Silverman, Rethinking genetic determinism, Scientist 18(10) (2004), 3233. [27] A. F. Goodman, C. M. Bellato, and L. Khidr, The uncertain future for central dogma: uncertainty serves as a bridge from (genetic) determinism and reductionism to a new picture of biology, Scientist. 19(12) (2005), 2021. [28] B. Efron, The future of indirect evidence, In Presented in the Foundations of Statistics Seminar Series, Statistical Science, 2011. [29] Journalism in the Age of Data, 2009. http://vimeo.com/ 14777910, http://datajournalism.stanford. edu/2009.5 [Last accessed on December 12, 2010]. [30] J. Thomas, and K. Cook, eds. Illuminating the Path.National Visualization and Analytics Center, 2008. http://nvac.pnl. gov/agenda.stm [Last accessed on December 12, 2010]. [31] P. Gibbons, 2010 SIAM Data Mining Conference. [32] S. Pantula, Joint Statistical Meetings.

[9] A. Goodman, 2010 SIAM Data Mining Conference, Columbus, Ohio, 2010. [10] D. Levermore, Workshop on Climate Prediction of the UCLA Institute for Pure and Applied Mathematics, Los Angeles, California, 2010. [11] A. Braverman, Workshop on Climate Prediction of the UCLA Institute for Pure and Applied Mathematics, Los Angeles, California, 2010. [12] B. Yu, Workshop on Climate Prediction of the UCLA Institute for Pure and Applied Mathematics, Los Angeles, California, 2010. [13] M. LeBlanc, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [14] J. Kettenring, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [15] L. Billard, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [16] R. Welsch, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [17] I. Liiv, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [18] A. Majda, Workshop on Climate Prediction of the UCLA Institute for Pure and Applied Mathematics. [19] A. Raftery, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [20] P. Domingos, 41st Symposium on the Interface of Computing Science and Statistics, Seattle, Washington, 2010. [21] J. Goodnight, Joint Statistical Meetings, Vancouver, Canada, 2010. [22] A. Goodman, Measurement of computer systemsan introduction, In AFIPS Conference Proceedings, Vol. 41, Part II: 1972 Fall Joint Computer Conference 19723 Years prior to the Formation of Computer Measurement Group, Primary Professional Society.

Statistical Analysis and Data Mining DOI:10.1002/sam

You might also like