You are on page 1of 7

practice

doi: 10.1145/ 1364782.1364797


on-demand reviews of nearby restau-
There is more to data access than SQL. rants), and ad hoc multiplayer games.
Over the next several years, new classes
by margo seltzer of mobile and personalized services,
impossible to predict today, will cer-
tainly be developed.

Beyond
While these services differ from one
another in major ways, they also share
some important attributes. Onethe
focus of this articleis the need for

Relational
data storage and retrieval functions
built into the application. Messaging
applications need to move messages
around the network reliably and with-

Databases
out loss. Location-based services need
to map physical location to logical lo-
cation (for example, GPS or cell-tower
coordinates to postal code) and then
look up location-based information.
Gaming applications must record and
share the current state of the game on
distributed devices and must manage
content retrieval and delivery to each
of the devices in real time. In all these
cases, fast, reliable data storage and re-
trieval are critical.
As soon as the discussion turns to
data storage and retrieval, relational
databases come to mind. Relational
databases have been tremendously
successful over the past three decades
T he number and variety of computing devices in the and SQL has become the lingua franca
environment are increasing rapidly. Real computers for data access. While data manage-
ment has become almost synonymous
are no longer tethered to desktops or locked in server with RDBMS, however, there are an
rooms. PDAs, highly mobile tablet and laptop devices, increasing number of applications for
palmtop computers, and mobile telephony handsets which lighter-weight alternatives are
more appropriate.
now offer powerful platforms for the delivery of new This article begins with a brief re-
applications and services. These devices are, however, view of how relational systems came to
dominate the data management land-
only the tip of the iceberg. Hidden from sight are the scape, and discusses how the relational
many computing and network elements required to technologies have evolved. It presents
support the infrastructure that makes ubiquitous a data-centric overview of todays emer-
gent applications, and delves into data
computing possible. management needs for todays and to-
With so much computing power traveling around morrows applications.
in briefcases and pockets, developers are building Relational Prehistory
applications that would have been impossible just a Relational databases came out of re-
few years ago. Among the interesting services available search at IBM1,5 and the University of
California at Berkeley7 in the 1970s. Re-
today are text and multimedia messaging, location- lational databases were fundamentally
based search and information services (for example, a reaction to escalating costs in deploy-

52 comm unicatio ns o f the ac m | J U LY 2008 | vo l . 5 1 | no. 7


practice

ing and maintaining complex systems. to fetch the data. These two changes related trends emerged. First, the RD-
The key observation was that pro- allowed programmers to describe the BMS vendors increased functionality
grammers, who were very expensive, information they wanted and to leave to provide market differentiators and
had to rewrite large amounts of appli- the details of optimization and access to address each new market niche as
cation software manually whenever the to the database management system. it arose. Second, few applications need
content or physical organization of a This transformation relieved program- all the features available in todays
database changed. Because the appli- mers of the burden of rewriting appli- RDBMSs, so as the feature set size in-
cation generally knew in detail how its cation code whenever the database lay- creased, each application used a de-
data was stored, including its on-disk out or organization changed. creasing fraction of that feature set.
layout, reorganizing databases or add- Relational databases enjoyed tre- This drive toward increasing DBMS
ing new information to existing data- mendous success in the IT shops and functionality has been accompanied
bases forced wholesale changes to the data centers of the world. Businesses by increasing complexity, and most
code accessing those databases. with large quantities of data to manage deployments now require a specialist,
Relational databases solved this and sophisticated applications using trained in database administration,
problem in two ways. First, they hid the that data adopted the new technology to keep the systems and applications
physical organization of the database quickly. Demand for relational prod- running. Since these systems are devel-
ILLUSTRATION BY CELIA JOH NSO N

from the application and provided only ucts created a market worth billions of oped and sold as monolithic entities,
a logical view of the data. Second, they dollars in licensing revenue per year. even though applications may require
used a declarative language to describe Several RDBMS vendors arose in the only a small subset of the systems
the data of interest in a particular que- 1980s to compete for this lucrative functionality, each installation pays
ry, rather than forcing the programmer business. the price of the total overall complexity.
to write a collection of function calls In the 20 years that followed, two Surely, there must be a better way.

JU LY 2 0 0 8 | vo L. 51 | n o. 7 | c om m u n ic at ion s of t he acm 53
practice

The New Frontier ers purchasing patterns, trends in Web search. Internet search en-
We are not the first to notice these product popularity, geographical pref- gines lie at the intersection of database
tides of change. In 1998, the leading erences, and countless other phenom- management and information retriev-
database researchers concluded that ena that can be exploited to increase al. The objects upon which they oper-
database management systems were sales or decrease the cost of doing busi- ate are typically semistructured (that
becoming too complex and that auto- ness. This database is read-mostly: it is is, HTML instead of raw text), but the
mated configuration and management updated in bulk by periodically adding queries posed are most often keyword
were becoming essential.2 Two years new transactions to the collection, but lookups where the desired response is
later, Surajit Chaudhuri and Gerhard it is read frequently as analysts cull the a sorted list of possible answers. Practi-
Weikum proposed radically rethink- data extracting useful tidbits. This ap- cally all the successful search engines
ing database management system plication domain is characterized by today have developed their own data
architecture.4 They suggested that da- enormous tables (tens or hundreds management solution to this problem,
tabase management systems be made of terabytes), queries that access only constructing efficient inverted indices
more modular and that we broaden a few of the many columns in a table, and highly parallelized implementa-
our thoughts about data management and a need to scan tables sorted in a tions of index and lookup. This appli-
to include rather simple, component- number of different ways. cation is read-mostly with bulk updates
based building blocks. Most recently, Directory services. As organizations and nontraditional indexing.
Michael Stonebraker joined the cho- become increasingly dependent upon Mobile device caching. The preva-
rus, arguing that one size no longer distributed resources and personnel, lence of small, mobile devices intro-
fits all, and citing particular applica- the demand for directory services has duces yet another category of applica-
tion examples where the conventional exploded.3 Directory servers provide tion: caching relevant portions of a
RDBMS architecture is inappropriate.8 fast lookup of entities arranged in a larger dataset on a smaller, low-func-
As argued by Stonebraker, the rela- hierarchical structure that frequently tionality device. While todays users
tional vendors have been providing the matches the hierarchical structure of think of their cell phones directory as
illusion that an RDBMS is the answer to an organization. The LDAP standard their own data collection, another view
any data management need. For exam- emerged in the 1990s in response to the might be to think of it as a cache of a
ple, as data warehousing and decision heavyweight ISO X.400/X.500 directory global phone and address directory.
support emerged as important appli- services. LDAP is now at the core of au- This model has attractive properties
cation domains, the vendors adapted thentication and identity management in particular, the ability to augment
products to address the specialized systems from a number of vendors (for the local dataset with entries as they
needs that arise in these new domains. example, IBM Tivolis Directory Server, are used or needed. Mobile telephony
They do this by hiding fairly different Microsofts Active Directory Server, the infrastructure requires similar caching
data management implementations Sun ONE Directory Server). Like data capabilities to maintain communica-
behind the familiar SQL front end. warehousing, LDAP is characterized by tion channels to the devices. The ac-
This model breaks down, however, as read-mostly access. Queries are either cess pattern observed in these caches
one begins to examine emerging data single-row retrieval (find the record is also read-mostly, and the data itself
needs in more depth. that corresponds to this user) or look- is completely transitory; it can be lost
Data warehousing. Retail organi- ups based on attribute values (find all and regenerated if necessary.
zations now have the ability to record users in the engineering department). XML management. Online transac-
every customer transaction, producing The prevalence of multivalued attri- tions are increasingly being conducted
an enormous data source that can be butes makes a relational representa- by exchanging XML-encoded docu-
mined for information about custom- tion quite inefficient. ments. The standard solution today in-
volves converting these documents into
a canonical relational organization,
storing them in an RDBMS, and then
converting again when one wishes to
use them. As more documents are cre-
ated, transmitted, and operated upon in
XML, these translations become unnec-
essary, inefficient, and tedious. Surely
there must be a better way. Native XML
data stores with Xquery and Xpath ac-
cess patterns represent the next wave
of storage evolution. While new items
ILLUSTRATION BY celia johnson

are constantly added to and removed


from an XML repository, the documents
themselves are largely read-only.
Stream processing. Stream process-
ing is a bit of an outcast in this laun-
dry list of data-intensive applications.

54 com municatio ns o f th e acm | J U LY 2008 | vo l . 5 1 | no. 7


practice

Strictly speaking, stream processing about the data being accessed. Thus,
is not a data management task; it is a the data management question be-
data-filtering task. That is, data is pro- comes how best to satisfy the needs of
duced at some source and sent stream- these different types of applications.
ing to recipients that filter the stream
for interesting events. For example, There are We claim (like Stonebraker) that there
really is no single right answer. In-
financial institutions watch stock tick-
ers looking for hotly traded items and/
fundamentally two stead, we must focus on flexible solu-

properties that
tions that can be tailored to the needs
or stocks that arent being traded as of a particular application.
heavily as expected.
The reason that these stream-
a solution must There are several ways to deliver flex-
ibility in todays changing data environ-
processing applications are included possess to address ment. The back-to-basics approach is
here is a linguistic one: the filters that
are typically desired in these environ-
the wide range to require that every single application
build its own data storage service. This
ments look like SQL; however, while of application option, while seemingly simple, is im-
SQL was designed to operate on persis-
tently stored tables, these queries act needs emerging practical in all but the simplest of appli-
cations. Some data-intensive applica-
upon a real time stream of data values. today: modularity tions running today, however, are built
Stonebraker explains in some depth
how poorly equipped databases are for and configurability. upon simple, homegrown solutions.
The second way to address the need
this task. Perhaps the bigger surprise for flexibility is to provide a smorgas-
is not that database systems are poorly bord of data management options,
equipped to address this task, but that each of which addresses a particular
because SQL appears to be the right application class. We see this approach
query language, developers use rela- emerging in the traditional relational
tional database systems for applica- market, where the SQL veneer is used to
tions that have no persistent storage! hide the different capabilities required
Stream processing represents a for OLTP and data warehousing.
class of applications that could benefit The third approach to flexibility is to
from a SQL-like query language atop a produce a storage engine that is more
data management system with prop- configurable so that it can be tuned to
erties that are radically different from the requirements of individual applica-
an RDBMS. Since streaming queries tions. This solution has the advantage
frequently operate on data observed of allowing concentrated investment
during a time window, some transient in a single storage system, improv-
local storage is necessary, but this stor- ing quality. Configurability, however,
age neednt be persistent, transaction- makes new demands of developers
al, or support complex query process- who use the database, since they must
ing. Instead, it must be blindingly fast. understand the configuration options
Although relational databases are well- and then integrate the data manage-
equipped to handle dynamic queries ment component properly into their
over relatively static or slowly changing product designs.
data, this application class is charac- In fact, the solution emerging in the
terized by a fairly static query set over marketplace is to have a handful of rea-
highly dynamic data. sonably configurable storage systems,
each of which is useful across a broad
Flexible Solutions application class.
Relational systems have been designed There are fundamentally two prop-
to satisfy online transaction process- erties that a solution must possess to
ing (OLTP) workloads characterized by address the wide range of application
ad hoc queries, significant write traffic, needs emerging today: modularity
and the need for strong transactional and configurability. Few applications
and integrity guarantees. In contrast, require all the functionality possible
the applications described here are al- in a data management system. If an
most all read-dominated, and stream- application doesnt need function-
ing applications dont even take advan- ality, it should not have to pay for
tage of persistent data, just an SQL-like that functionality in size (footprint,
query language. Few of these applica- memory consumption, disk utiliza-
tions require transactional guarantees, tion, and so on), complexity, or cost.
and there is little inherently relational Therefore, a flexible engine must allow

JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm 55
practice

the developer to use or exclude major incomplete transactions).6 In a conven-


subsystems depending on whether the tional database management system,
application needs them. Once a system locking is assumed; in the brave new
is sufficiently modular to permit a truly world discussed here, locking is op-
small footprint, we will find that sys-
tem deployed on an array of hardware Old-style database tional and different components can
be used to provide different levels of
platforms with staggeringly large dif-
ferences in capabilities. In these cases,
systems solve concurrency.
Transactions provide the illusion
the system must be configurable to its old-style problems; that a collection of operations are ap-
operating environment: the specific
hardware, operating system, and appli-
we need new-style plied to a database in an atomic unit
and that once applied, the operations
cation using it. databases to solve will persist, even in the face of appli-

Modularity
new-style problems. cation or system failure. Transaction
management is at the heart of most da-
Some argue that database architecture tabase management systems, yet many
is in need of a revolution akin to the applications do not require transac-
RISC revolution in computer hardware. tions. In a component-based world,
The conventional monolithic DBMS ar- transactions, too, are optional. When
chitecture is not facile enough to adapt they are present, a system might still
to todays data demands, so we must have a number of different components
build data management capabilities providing basic transactional mecha-
out of a collection of small, simple, nisms, savepoints (the ability to identi-
reusable components. For example, fy a point in time to which the database
instead of viewing SQL as a simple bi- may be rolled back), two-phase commit
nary decision, Chaudhuri and Weikum to support transactions that span mul-
argue that query capabilities should be tiple databases, nested transactions
provided at different levels of sophisti- to decompose a large operation into a
cation: a single-table selection proces- number of smaller ones, and compen-
sor that has a B+ tree index that sup- sating transactions to undo high-level,
ports simple indexing, updating, and logical operations.
selection. To this, you might add trans- Many transaction systems use some
actions. Continuing up the complex- form of logging to provide rollback and
ity hierarchy, consider a select-project- recovery capabilities. In that context,
join processor. Next, add aggregates. In it hardly seems necessary to treat log-
this manner, you transform SQL from ging as a separable component, but it
a monolithic language into a family should be. A transactional component
of successively richer languages, each might be designed to work with mul-
of which is provided as a component tiple implementations, some of which
and satisfies a significant number of do not use logging (for example, no-
application domains. Any particular overwrite schemes such as shadow-pag-
application selects the components it es). Perhaps even more interesting, a
needs. This idea of a component-based logging system might be useful outside
architecture can be extended to in- the context of transactions; it might be
clude several other aspects of database used for auditing or provide some sort
design: concurrency control, transac- of backup mechanism. In either case,
tions, logging, and high availability. it should be an application designers
Concurrency control lends itself to decision whether logging is necessary
a hierarchy similar to that presented in rather than having it imposed by the
the language example. Some applica- database vendor.
tions are completely single-threaded Finally, data is sometimes so critical
and require no locking; others have low that downtime is unacceptable. Many
levels of concurrency and would be well database systems provide replicated
served by table-level locks or API-level or highly available systems to address
locks (allowing only one writer or mul- this need. Although this functionality is
tiple readers into the database system often available as an add-on in todays
simultaneously); finally, highly con- systems, they have not gone far enough.
current applications need fine-grain A developer may wish to use a data-
locking and multiple degrees of isola- bases HA (high-availability) configura-
tion (potentially allowing applications tion, but may use it in conjunction with
to see values that have been written by some other companys HA substrate. If

56 comm unicatio ns o f the acm | J U LY 2008 | vo l . 5 1 | no. 7


practice

the application already has a substrate two applications both want transac- make the right decisions.
that performs heartbeat protocols (or tions and B-trees, this does not mean Variability in persistent storage
any other mechanism that notifies the that both can support a multi-gigabyte technologies places new demands
application or system when a compo- in-memory cache. The ability to adapt on the database engine as well. Not
nent fails), fail-over, and redundant to radically different circumstances is only must it work well in the presence
communication channels, then you critical. Configurability refers to how of spinning, magnetic storage, but it
will want to exclude those components well a system can be matched to its en- should also run well on other media
from the database management sys- vironment and application needs. In (for example, flash) with constraints on
tem and hook into the existing func- this article we discuss configurability behaviors (such as the number of writes
tionality. Monolithic systems do not al- with respect to the hardware, the envi- to a particular memory location), and it
low this, whereas a component-based, ronment in which the application runs may need to run in the absence of any
modular architecture does. (for example, the operating system), persistent storage. For example, some
In addition to providing smaller, the applications software architecture, applications want to manage data en-
simpler applications, components with and the natural data format of the ap- tirely in main memory, with no per-
well-defined, clean, exposed interfaces plication. sistence; some want to manage data
provide for a degree of extensibility that Hardware environments introduce with full synchronous transactional
is simply not possible in a monolithic variability in CPU speed, memory size, guarantees on updates; and some need
system. For example, consider the ba- and persistent storage capabilities. something in the middle. Each of these
sic set of components needed to con- Variability in CPU speed and persis- policies should be implemented by
struct a transactional system: a trans- tent storage introduces the possibility the same transactional component,
action manager, a lock manager, and a of trading computation for disk band- but the database should allow the pro-
log manager. If these modules are open width. On a fast processor, it may be grammer to control whether or not data
and extensible, then the developer can beneficial to compress data, consum- persists across power-down events and
build systems that incorporate items ing CPU cycles, in order to save I/O; the strictness of any transactional as-
that are not managed by the database on a PDA, where CPU cycles are sparse surances that the system makes to the
system into transactions. Consider, for and persistent I/O is fast, compression end user.
example, a network switch: the state of might not be the right trade-off. Although many embedded systems
the configuration database depends on In a world where resource-con- are now able to use commodity off-the-
the state of hardware inside the device, strained devices require potentially so- shelf hardware platforms, many pro-
and vice versa. If the electrical control phisticated data management, develop- prietary devices still exist. The ubiqui-
over chips and boards can be incorpo- ers must have control over the memory tous data management solution will be
rated into transactions, by allowing the and disk consumption policies of the portable to these special-purpose hard-
programmer to extend the locking and database. In different environments, ware devices. It will also be portable to a
logging system to communicate with applications may need control over the variety of operating systems as well; the
them, then operations such as power maximum size of in-memory data struc- services available from the operating
up the backup network interface card tures, the maximum size of persistent system on a mobile telephone handset
can be made transactional. data, and the space consumed by trans- are different from those available on a
Modularity is a powerful tool for actional logs. Policies for consump- 64-way multiprocessor with gigabytes
managing size and complexity of appli- tion of these resources must be set by of RAM, even if both are running Linux.
cations and systems while also enabling the application developer, not the end If the data management system is to
the application and data management user, since the developer is more likely run everywhere, then it must rely only
capabilities to seamlessly interact. to have the technical savvy necessary to on the services common to most oper-
Thus, we have proposed an architec-
ture that enables developers to exclude
functionality they do not need and in-
clude functionality they do need but is
not provided by the database vendor.

Configurability
The second property of a flexible data
management system is configurability.
Whereas modularity is an architectural
mechanism, configuration is mostly a
runtime mechanism. With a compo-
ILLUSTRATION BY celia johnson

nent-based architecture, the build-time


configuration is involved in selecting
appropriate components. A single col-
lection of components may still run on
a range of systems with wildly different
capabilities. For example, just because

JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm 57
practice

ating systems, and it must provide ex- the data is clustered according to the XML, object-oriented, among others)
plicit mechanisms to allow portability, correct criteria. In the case of a configu- would add overhead for no benefit. The
through simple interposition libraries rable database system, this means that configurable engine must support stor-
or source-code availability. the developer needs to retain control ing data in the format that is most nat-
Even on a single platform, the de- over primary key selection (as is done ural for the application. It is then the
veloper makes architectural choices in most relational database manage- programmers responsibility to select
that affect the database system. For ex- ment systems) and must be able to ig- the format that meets the most natu-
ample, a system may be built using: a nore clustering issues if the persistent ral criteria.
single thread of control; a collection of medium either does not exist or does
cooperating processes, each of which not show performance benefits to ac- New-Style Databases
is single-threaded; multiple threads cessing locations that are close to the for New-Style Problems
of control in a single process; multiple last access. Old-style database systems solve old-
multithreaded processes; or a strictly On a related note, the developer style problems; we need new-style da-
event-based architecture. These choic- must be left the flexibility to select an tabases to solve new-style problems.
es are driven by a combination of the indexing structure for the primary keys While the need for conventional da-
applications requirements, the devel- that is appropriate for the workload. tabase management systems isnt go-
opers preferences, the operating sys- Workloads with locality of reference ing away, many of todays problems
tem, and the hardware. The database are probably well served by B+ trees; require a configurable database sys-
system must accommodate them. those with huge datasets and truly ran- tem. Even without a crystal ball, it
The database must also avoid mak- dom access might be better off with seems clear that tomorrows systems
ing decisions about network protocols. hash tables. Perhaps the data is highly will also require a significant degree of
Since the database will run in environ- dimensional and require a completely configurability. As programmers and
ments where communication takes different indexing structure; the exten- engineers, we learn to select the right
place over backplanes, as well as en- sibility discussed in the previous sec- tool to do a job; selecting a database is
vironments where it takes place over tion should allow a developer to pro- no exception. We need to operate in a
WANs, the developer should select vide an application-specific indexing mode where we recognize that there
the appropriate communication infra- mechanism and use it with all of the are options in data management, and
structure. A special-purpose telephone systems other features (for example, we should select the right tool to get
switch chassis may include a custom locking, transactions). At a minimum, the job done as efficiently, robustly,
backplane and protocol for fast com- the configurable database should pro- and simply as possible.
munication among redundant boards; vide a range of alternative indexing
the database must not prevent the de- structures that support iteration, fast References
veloper from using it. equality searches, and range searches, 1. Astrahan, M.M. System R: Relational approach
Up to this point, configurability has to database management. ACM Trans. Database
including searches on partial keys. Systems 1, 2 (1976), 97137.
revolved around adapting to the hard- Unlike relational engines, the con- 2. Bernstein, P. The Asilomar Report on database
research. ACM SIGMOD Record 27, 4 (1998); www.
ware and software environment of the figurable engine should permit the sigmod.org/record/issues/9812/asilomar.html.
application. The last area of configura- programmer to determine the inter- 3. Broussard, F. Worldwide IT asset management
software forecast and analysis, 20022007. (2004).
tion that we address revolves around nal structure of its data items. If the IDC Doc. #30277; www.idc.com/getdoc.jsp?containerI
the applications data. Data layout, in- application has a dynamic or evolving d=30277&pid=35178981.
4. Chaudhuri, S., and Weikum, G. Rethinking database
dexing, and access are critical perfor- schema or must support ad hoc que- system architecture: Towards a self-tuning RISC-
mance considerations. There are three ries, then the internal structure should style database system. The VLDB Journal. (2000),
110; www.vldb.org/conf/2000/P001.pdf.
main design points with respect to data: be one that enables high-level query ac- 5. Codd, E.F. A relational model of data for large shared
the physical clustering, the indexing cess such as SQL, Xpath, Xquery, LDAP, data banks. Commun. ACM 13, 6 (June 1970):
377387.
mechanism, and the internal structure etc. If, however, the schema is static 6. Gray, J., and Reuter, A. Transaction Processing:
of items in the database. Some of these, and the query set is known, selecting Concepts and Technologies. Morgan Kaufman, San
Mateo, CA, 1993, 397-402
like the indexing mechanism, really an internal structure that maps more 7. Stonebraker, M. The design and implementation of
are runtime configuration decisions, directly to the applications internal Ingres. ACM Trans. Database Systems 1, 3 (1976),
189222.
whereas others are more about giving data structures provides significant 8. Stonebraker, M., and Cetintemel, U. One size fits
the application the ability to make de- all: An idea whose time has come and gone. In
performance improvements. For ex- Proceedings of the 2005 International Conference on
sign decisions, rather than having de- ample, if an applications data is inher- Data Engineering (April 2005); http://www.cs.brown.
edu/~ugur/fits_all.pdf.
signers forced into decisions because ently nonrelational (for example, con-
of the database management system. taining multivalued attributes or large
Margo I. Seltzer (margo@eecs.harvard.edu) is the
Database management systems de- chunks of unstructured data), then Herchel Smith Professor of Computer Science and a
signed for spinning magnetic media forcing it into a relational organiza- Harvard College Professor in the Division of Engineering
and Applied Sciences at Harvard University, Cambridge,
expend considerable effort clustering tion simply to facilitate SQL access will MA. She is also a founder and CTO of Sleepycat Software,
related data together on disk so that cost performance in the translation the makers of Berkeley DB.
seek and rotation times can be amor- and is unlikely to reap the benefits of A previous version of this article appeared in the April 2005
tized by transferring a large amount of the relational store. Similarly, if the ap- issue of ACM Queue.Vol 3, .No. 3.

data per repositioning event. In gen- plications data was relational, forcing
eral, this clustering is good, as long as it into a different format (for example, 2008 ACM 0001-0782/08/0700 $5.00

58 comm unicatio ns o f the acm | J U LY 2008 | vo l . 5 1 | no. 7

You might also like