You are on page 1of 3

ENERGY Critics of energy-

COMMENT ANTHROPOLOGY Jared Diamonds HISTORY Heroism, intrigue PETITIONS Ganging up on


efficiency policy overplay paean to traditional societies, and posturing abound in a research damages scientific
the rebound effect p.475 reviewed p.477 history of Antarctica p.478 discourse p.480
ANN C. BRYANT & THOMAS H. PAINTER, JPL/CALTECH SNOW OPTICS LAB.

A satellite image of snow on the Hindu Kush mountains in Asia, with regions of high absorption of sunlight by dust and black carbon shaded in red.

A vision for data science


To get the best out of big data, funding agencies should develop shared tools for
optimizing discovery and train a new breed of researchers, says Chris A. Mattmann.

T
wo small words big data (1012 bytes) are now common in Earth and I believe that four advancements are
are getting a lot of play across the space sciences, physics and genomics (see necessary to achieve that aim. Methods for
sciences. Funding agencies, such as Data deluge). But a lack of investment in integrating diverse algorithms seamlessly
the National Science Foundation and the services such as algorithm integration and into big-data architectures need to be found.
National Institutes of Health in the United file-format translation is limiting the ability Software development and archiving should
States, have created million-dollar pro- to manipulate archival data to reveal new be brought together under one roof. Data
grammes around the challenges of storing science. reading must become automated among
and handling vast data streams. Although At the Jet Propulsion Laboratory (JPL) formats. Ultimately, the interpretation of
these are important, I believe that agencies in Pasadena, California, I am a principal vast streams of scientific data will require a
should focus on developing shared tools for investigator in a big-data initiative, pursu- new breed of researcher equally familiar with
optimizing discovery. ing projects on data archiving and mining, science and advanced computing.
Big data are big in three ways: the volume smart algorithms and low-power hardware
of information that systems must ingest, for astronomy and Earth science. Rather than ALGORITHM INTEGRATION
process and disseminate; the number and finding one system that can do it all for any A project by my team at the JPL illustrates
complexity of the types of information data set, my team aims to define a set of archi- the challenges of working with big data. In
handled; and the rate at which information tectural patterns and collaboration models 2011, we were asked by the US National
streams in or out. Terabyte-sized data sets that can be adapted to a range of projects. Climate Assessment to establish a

2 4 JA N UA RY 2 0 1 3 | VO L 4 9 3 | N AT U R E | 4 7 3
2013 Macmillan Publishers Limited. All rights reserved
COMMENT

computing facility to integrate a range of Comparisons of observational and model


snow-related measurements and to do so DATA DELUGE data are, for example, under construction for
The billions of terabytes (TB) produced in one year
in a month. The data included observations by the SKA telescope (grey) will dwarf today's data the US National Climate Assessment and the
from the western United States, Alaska and sets in genomics and climate science. Coupled Model Intercomparison Project of
the Hindu KushHimalayan regions, as well Encyclopedia of the Intergovernmental Panel on Climate
as the entire Earth-observing record since DNA Elements Change. NASA uses the Hierarchical Data
(ENCODE), 2012
2000 and subsequent monitoring. The data Format version 5 (HDF-5) and the HDF-
products and maps would amount to several 15 TB Earth Observing System metadata repre-
hundred terabytes. sentation. The outputs of climate models
US National
The algorithms to be incorporated were Climate Assessment
are stored in the Network Common Data
varied, and included codes for estimating (NASA projects), 2013 Form, typically with climate and forecast
snow coverage, grain size and absorp- 1,000 TB metadata conventions9. Automatic methods
tion of solar radiation by dust and black will be needed to match and analyse these
carbon1. They had been written in IDL, a data, which amount to petabytes (1015 bytes).
specialized programming language used by Fifth assessment report Some big-data fields are switching to
many researchers. Geographers, remote- by the Intergovernmental formats like these that have better sup-
Panel on Climate Change
sensing experts and software programmers (IPCC), due 2014 port. Astronomers, for instance, are turn-
contributed. ing to NASAs HDF-5 file format from the
Most computer scientists would assume
2,500 TB Flexible Image Transport System that has
that such a system would take years, not been their standard. But history shows
weeks, to develop. The algorithms would Square Kilometre Array that defining a single, unifying file for-
presumably have to be rewritten in a stand- (SKA), first light due 2020 mat is not the answer, because prolifera-
ard language such as C++, Java or Python, 22,000,000,000 TB tion of file types will continue. Instead, we
per year
or one that could run on a fast computer need a toolkit of automatic ways to boil
system or infrastructure, such as Googles file formats down to their essence, and
MapReduce model. more formats that are amenable to those
But, in my experience, there is no need Apache Hadoop7 and Apache Tika8, used approaches. We need flexible systems that
to rewrite scientific algorithms for big- in Earth science, biomedicine and business. can perform multiple functions and deal
data systems. Rewriting only increases the Although data interpretation and archiv- with diverse data. Encouraging efforts are
barriers to communication between scien- ing efforts have so far been funded separately under way, including with Apache OODT10
tists and computer engineers. Rewriting can and at strikingly dissimilar levels, their needs and Apache Tika8.
also introduce costly errors. such as workflow processing and file and
Computer engineers should trust resource management are complemen- PEOPLE POWER
scientists to produce executable algo- tary and overlapping. As storage and com- To solve big-data challenges, researchers
rithms, which can be plugged into a larger putation costs fall, algorithm developers are need skills in both science and computing
processing framework. The skill is in tying moving into preservation, both to archive a combination that is still all too rare. A
the input and output files and relevant their own work and to open new research new breed of data scientist is necessary.
parameters unobtrusively into the big- windows on large data sets that were previ- As well as being data stewards, data
data network, so that the algorithm can ously closed. scientists will develop bespoke algorithms
run seamlessly within it. With a modu- In the next decade, I believe that archives for analysis and adapt file formats. They
lar approach, development can proceed and science-computing facilities must merge. will understand the mathematics, stat
quickly in parallel we constructed our The international radio-astronomy commu- istics and physics necessary to integrate
snow-science computing facility this way nity is doing so in preparation for the Square science algorithms into efficient architec-
in less than a month. Kilometre Array radio telescope, due to see tures. They will find solutions beyond the
first light in 2020. The enormous volume of fragmented community efforts that have
DEVELOPMENT AND STEWARDSHIP data that the array will produce 700 tera dominated the past decade of development
Today, different big-data computing tasks bytes each second will, after just a few of big-data systems.
are usually undertaken by different teams. days, eclipse the current size of the Internet. Funding agencies should support comput-
The bulk of agency funding goes to building Archives in the United States such as those at ing facilities that combine big-data steward-
specific long-standing archives or data grids2 the National Radio Astronomy Observatorys ship and software development, employing
systems such as the NASA Earth science Expanded Very Large Array and the Atacama data scientists to bridge the gap. Coordina-
Distributed Active Archive Centers or the Large Millimeter/submillimeter Array are tion between agencies is crucial to avoid
International Virtual Observatory Alliance in developing software to handle that deluge. duplication. The Big Data Senior Steering
astronomy that disseminate, preserve and Group, linking efforts across the National
steward3 data. Large archives have received MANY FORMATS Science Foundation, the National Institutes
an average of US$100million a year from US Big-data systems must deal with thousands of Health, NASA and others, is a promising
federal agencies over the past decade. of file types and conventions. The commu- early example. More oversight will be needed
By contrast, the development, integration nities that have formed around informa- to establish new working patterns.
and updating of science algorithms receives tion modelling, ontology and semantic web Because big-data fields stretch across
only between $1 million and $5 million per software address this complexity of data and national as well as disciplinary boundaries,
year in the United States. These tasks are metadata (descriptive terms attached to files) such facilities and panels must be interna-
carried out in science-computing facilities, to some extent. But they have so far relied on tional. In centres of excellence around the
which are often small and transient. Because human intervention. None has delivered the world, such as the JPL, data scientists will
they must do more for less, such facilities silver bullet: automatic solutions that iden- help astronomers and Earth scientists to
largely use and generate community-based tify file types and extract meaningful data share their approaches with bioinformati-
open-source software46. Examples include from them. cians, and vice versa.

4 7 4 | N AT U R E | VO L 4 9 3 | 2 4 JA N UA RY 2 0 1 3
2013 Macmillan Publishers Limited. All rights reserved
COMMENT

For the specialism to emerge and

ZHANG JUN/XINHUA PRESS/CORBIS


grow, data scientists will have to over-
come barriers that are common to
multidisciplinary research. As well as
acquiring understanding of a range of
science subjects, they must gain aca-
demic recognition. Journals such as the
Data Science Journal should become
more prominent within the comput-
ing community. Software products and
technologies should be valued more by
academic committees.
New interdisciplinary courses will
be needed. The University of Califor-
nia, Berkeley, and Stanford University
in California have set up introductory
courses for computer scientists on big-
data techniques more universities
should follow suit. Natural scientists,
too, should become familiar with com-
puting and format issues.
In my lectures for computer-science
graduates, I have brought together stu-
dents at the University of Southern Cali-
fornia in Los Angeles with researchers at
the JPL. Using real projects, my students
see the challenges awaiting them in their
future careers. I hope to employ some of
them on the projects that will flow from
the JPLs big-data initiative. The technolo-
gies and approaches that they develop will
spread beyond NASA through contribu-
tions to the open-source community.
Empowering students with knowledge
of big-data infrastructures and open-
source systems now will allow them to
make steps towards addressing the major
challenges that big data pose.

Chris A. Mattmann is a senior Fuel-efficient cars cost less to run, so people might use them a little more.
computer scientist at the Jet Propulsion

The rebound effect


Laboratory, California Institute of
Technology, Pasadena, California 91109,
USA, and adjunct assistant professor
in computer science at the University

is overplayed
of Southern California, Los Angeles,
California 90089, USA.
e-mail: chris.a.mattmann@nasa.gov

1. Painter, T. H., Bryant, A. C. & Skiles, S. M.


Geophys. Res. Lett. 39, L17502 (2012).
2. Foster, I., Kesselman, C. & Tuecke, S. Int. J.
Increasing energy efficiency brings emissions savings.
High Perform. Comput. Appl. 15, 200222
(2001).
Claims that it backfires are a distraction, say Kenneth
3. Lynch, C. Nature 455, 2829 (2008).
4. Morin, A. et al. Science 336, 159160 (2012).
Gillingham and colleagues.
5. Spinellis, D. & Giannikas, V. J. Syst. Softw. 85,

B
666682 (2012).
6. Ven, K., Verelst, J. & Mannaert, H. IEEE uy a more fuel-efficient car and you supposed energy savings turn into greater
Software 25, 5459 (2008).
7. White, T. Hadoop: The Definitive Guide 2nd
will spend more time behind the energy use stems from nineteenth-century
edn (OReilly Media/Yahoo Press, 2010). wheel. That argument, termed the economist Stanley Jevons. In his 1865 book
8. Mattmann, C. A. & Zitting, J. L. Tika in Action rebound effect, has earned critics of energy- The Coal Question, Jevons hypothesized
(Manning, 2011).
9. Cinquini, L. et al. Proc. 2012 IEEE 8th Int.
efficiency programmes a voice in the that energy use rises as industry becomes
Conf. E-Science Chicago, Illinois, climate-policy debate, for example with an more efficient because people produce and
812 October 2012 (in the press). article in The New York Times entitled When consume more goods as a result2.
10. Mattmann, C. A., Crichton, D. J., Medvidovic,N. energy efficiency sullies the environment1. The rebound effect is real and should be
& Hughes, S. in Proc. 28th Int. Conf. Software
Engineering (ICSE06), Software Engineering The rebound effect idea and its extreme considered in strategic energy planning.
Achievements Track 721730 (2006). variant the backfire effect, in which But it has become a distraction. A vast

2 4 JA N UA RY 2 0 1 3 | VO L 4 9 3 | N AT U R E | 4 7 5
2013 Macmillan Publishers Limited. All rights reserved

You might also like