Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
I NTRODUCTION
T-H. Chen, A. E. Hassan are with the Software Analysis and Intelligence
Lab (SAIL) in the School of Computing at Queens University, Canada.
E-mail: {tsehsun, ahmed}@cs.queensu.ca
W. Shang is with Concordia University, Canada.
E-mail: shang@encs.concordia.ca
Z. Jiang is with York University, Canada.
E-mail: zmjiang@cse.yorku.ca
M. Nasser and P. Flora are with BlackBerry, Canada.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
Even though other columns (e.g., address, phone number, and profile picture) were not modified by the code, the
corresponding generated SQL query is:
update user set name=Peter, address=Waterloo,
phone number = 12345, profile pic = binary data
where id=1;
Such redundant data problems may bring significant
performance overheads when, for example, the generated
SQLs are constantly updating binary large objects (e.g.,
profile picture) or non-clustered indexed columns (e.g., assuming phone number is indexed) in a database table [69].
The redundant data problems may also cause a significant
performance impact when the number of columns in a table
is large (e.g., retrieving a large number of unused columns
from the DBMS). Prior studies [50], [58] have shown that
the number of columns in a table can be very large in
real-world systems (e.g., the tables in the OSCAR database
have 30 columns on average [50]), and some systems may
even have tables with more than 500 columns [7]. Thus,
locating redundant data problems is helpful for large-scale
real-world systems.
In fact, developers have shown that by optimizing ORM
configurations and data retrieval, system performance can
increase by as much as 10 folds [15], [63]. However, even
though developers can change ORM code configurations to
resolve different kinds of redundant data problems, due to
the complexity of software systems, developers may not be
able to locate such problems in the code, and thus may not
proactively resolve the problems [15], [40]. Besides, there is
no guarantee that every developer knows the impact of such
problems.
In this paper, we propose an approach for locating
redundant data problems in the code. We implemented
the approach as a framework for detecting redundant data
problems in Java-based ORM frameworks. Our framework
is now being used by our industry partner to locate redundant data problems.
Redundant data or computation is a well-known cause
for performance problems [53], [54], and in this paper, we
focus on detecting database-related redundant data problems. Our approach consists of both static and dynamic
analysis. We first apply static analysis on the source code
to automatically identify database-accessing functions (i.e.,
functions that may access the data in the DBMS). Then, we
use bytecode instrumentation on the system executables to
obtain the code execution traces and the ORM generated
SQL queries. We identify the needed database accesses
by finding which database-accessing functions are called
during the system execution. We identify the requested
database accesses by analyzing the ORM generated SQL
queries. Finally, we discover instances of the redundant data
problems by examining the data access mismatches between
the needed database accesses and the requested database
accesses, within and across transactions. Our hybrid (static
and dynamic analysis) approach can minimize the inaccuracy of applying only data flow and pointer analysis on the
code, and thus can provide developers a more complete
picture of the root cause of the problems under different
workloads.
2)
3)
R ELATED W ORK
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
BACKGROUND
Background of ORM
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
User.java
@Entity
@Table ( name = user )
public class User {
@Id
@Column(name=user_id)
private long userId ;
@Column(name=user_name)
private String userName;
... other instance variables
@ManyToOne
@JoinColumn(name=group_id)
private Group group;
ORM
ORM
cache
Database
Fig. 1. An example flow of how JPA translates object manipulation to SQL. Although the syntax and configurations may be different for other ORM
frameworks, the fundamental idea is the same: developers need to specify the mapping between objects and database tables, the relationships
between objects, and the data retrieval configuration (e.g., eager v.s. lazy).
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
5
SQLs
Studied
Systems
Source
Code
Exercising workloads
Static
analysis
List of
database-accessing
functions
Analyzing SQL
Code
Execution
Traces
Combining info
Requested
Database
Accesses
Assessing
Transactions Performance
Performance
With
impact
Comparing
impact
Redundant
assessment
Data
Needed
Database
Accesses
Fig. 2. An overview of our approach for finding and evaluating redundant data problems.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
TABLE 1
Statistics of the studied systems.
System
Pet Clinic
Broadleaf 3.0
ES
Total lines
of code (K)
No. of
files
Max. No. of
columns
3.3K
206K
> 300K
51
1,795
> 3,000
6
28
> 50
performance optimization efforts. Since there may be different types of redundant data problems and each type may
need to be assessed differently, we discuss our assessment
approach in detail in Section 6.3, after discussing the types of
redundant data problems that we discovered in Section 6.2.
E XPERIMENTAL S ETUP
Experiments
TABLE 2
Overview of the redundant data problems that we discovered in our
exercised workloads. Trans. column shows where the redundant data
problem is discovered (i.e., within a transaction or across transactions).
Types
Trans.
Description
Update all
Select all
Within
Within
Excessive data
Within
Per-trans cache
Across
E VALUATION OF O UR A PPROACH
Framework Implementation
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
TABLE 3
Prevalence of the discovered redundant data problems in each test suite. The detail of ES is not shown due to NDA.
System
Test Case
Description
Pet Clinic
Broadleaf
ES
Phone Controller
Payment Info
Customer Addr.
Customer
Offer
Multiple Features
Total No.
of Trans.
Total No.
of Trans. with
Redundant Data
60
60 (100%)
6 (10%)
60 (100%)
50 (83%)
7 (12%)
807
813
611
604
419
> 1000
805 (99%)
611 (75%)
609 (99%)
602 (99%)
201 (48%)
> 30%
4 (0.5%)
10 (1.6%)
7 (1.1%)
4 (0.7%)
19 (9%)
3%
805 (100%)
611 (100%)
607 (99%)
602 (100%)
19 (9%)
100%
203 (25%)
7 (1.1%)
7 (1.1%)
3 (0.5%)
17 (9%)
0%
202 (25%)
200 (25%)
203 (33%)
200 (33%)
201 (100%)
23%
that we study here is by no means complete. In the following subsections, we first describe the type of redundant
data problems that we discovered, then we discuss their
prevalence in our studied systems.
6.2.1
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
The first select is needed because the user data was previously updated by another SQL query (the same primary
key in the where clause). The second select is not needed
as the data is not changed. Most ORM frameworks provide
cache mechanisms to reuse fetched data and to minimize
database accesses (Section 7), but the cache configuration
is never automatically optimized for different application
systems [44]. Thus, some ORM frameworks even turn the
cache off by default. Developers are aware of the advantages
of having a global cache shared among transactions [64],
but they may not proactively leverage the benefit of such a
cache. We have seen cases in real-world large-scale systems
where this redundant data problem causes the exact same
SQL query to be executed millions of times in a short
period of time, even though the retrieved entity objects
are not modified. The cache configuration may be slightly
different for different ORM frameworks, but our approach
is able to give a detailed view on the overall data read and
write. Thus, our approach can assist developers with cache
optimization.
Note that, although some developers are aware of the
above-mentioned redundant data problems, it is not a
common knowledge. Moreover, some developers may still
forget to resolve the problem, as a redundant data problem
may become more severe as a system ages.
TABLE 4
Total number of SQLs and the number of duplicated selects in each
test suite.
System
Test Case
Description
Total No.
No. of
of SQL queries Duplicate Selects
32,921
29,882 (91%)
Broadleaf
Phone Controller
Payment Info
Customer Addr.
Customer
Offer
1,771
1,591
2,817
1,349
1,052
431 (24%)
11 (0.7%)
21 (0.7%)
22 (1.6%)
41 (3.9%)
ES
Multiple Features
>> 10,000
> 10%
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
trivial
small
effect size =
medium
large
if Cohens d 0.2
if 0.2 < Cohens d 0.5
if 0.5 < Cohens d 0.8
if 0.8 < Cohens d
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
10
TABLE 5
Performance impact study by resolving the redundant data problems in each test suite. Response time is measured in seconds at the client side.
We mark the results in bold if resolving the redundant data problems has a statistically significant improvement. For response time differences,
large/medium/small/trivial effect sizes are marked with L, M, S, and T, respectively.
System
Test Case
Base
Update All
p-value
resp. time
14.00.69
18.72.6
29.71.17
13.70.63
22.90.95
13.40.60
17.30.65
28.40.58
13.00.49
21.31.05
ES
0%
Multiple Features
(-4%)M
(-7%)M
(-4%)M
(-5%)M
(-7%)L
Select All
p-value
resp. time
Excessive Data
p-value
resp. time
Per-trans. Cache
p-value
resp. time
0.85
0.007
0.04
<<0.001
<<0.001
<<0.001
13.90.65 (0%)T
18.10.88 (-3%)S
28.70.45 (-3%)M
13.00.57 (-5%)M
22.31.08 (-3%)S
0.44
0.37
0.001
<<0.001
0.13
>0.05
> 30%L
<<0.001
Select All. The select all problem causes a statistically significant performance impact in Pet Clinic, ES, and two test
suites in Broadleaf (331% improvement) with varying effect
sizes. Due to the nature of the Broadleaf test suites, some
columns have null values, which reduce the overhead of
data transmission. Thus, the effect of the select all problem
is not as significant as the update all problem. In addition
to what we discuss in Section 6.2, select all may also cause
a higher performance impact in column-oriented DBMSs.
When selecting many different columns from a columnoriented DBMS, the DBMS engine needs to seek for the
columns in different data storage pools, which would significantly increase the time needed to retrieve data from the
DBMS.
Excessive Data. We find that the excessive data problem has
a high performance impact in Pet Clinic (92% performance
improvement), but only 26% improvement in Broadleaf
and 5% in ES with mostly non-trivial effect sizes. Since we
know that the performance impact of the redundant data
problem is highly dependent on the exercised workloads,
we are interested in knowing the reasons that cause the large
differences. After a manual investigation, we find that the
excessively selected table in Pet Clinic has a @OneToMany
relationship. Namely, the transaction is selecting multiple
associated excessive objects from the DBMS. On the other
hand, most excessive data in Broadleaf has a @ManyToOne or
@OneToOne relationship. Nevertheless, excessively retrieving single associated object (e.g., excessively retrieving the
child object in a @ManyToOne or @OneToOne relationship)
may still cause significant performance problems [6]. For
example, if the eagerly retrieved object contains large data
(e.g., binary data), or the object has a deep inheritance
relationship (e.g., the eagerly retrieved object also eagerly
retrieves many other associated objects), the performance
would also be impacted significantly.
Per-Transaction Cache. The per transaction cache problem
has a statistically significant performance impact in 4 out
of 5 test suites in Broadleaf with non-trivial effect sizes.
We also see a large performance improvement in Pet Clinic,
where resolving the per-transaction cache problem improves
the performance by 88%. Resolving the per-transaction cache
problem also improves the ES performance by 10% (with
large effect sizes).
The performance impact of the per-transaction cache may
be large if, for example, some frequently accessed read-only
entity objects are stored in the DBMS and are not shared
among transactions [69]. These objects will be retrieved
13.80.47 (-1%)S
18.31.0 (-2%)T
29.00.68 (-2%)S
12.90.47 (-6%)M
23.41.27 (+2%)S
0.28
0.58
0.05
<<0.001
0.17
13.00.56 (-7%)L
18.00.74 (-4%)S
28.80.54 (-3%)S
13.30.53 (-3%)S
21.90.87 (-4%)M
<<0.001
0.33
0.008
0.03
0.002
> 30%L
<<0.001
TABLE 6
Existence of the studied redundant data problems in the surveyed
ORM frameworks (under default configurations).
Lang.
ORM
Framework
Update
all
Select
all
Exce.
Data
Per-trans.
Cache
Java
Hibernate
Yes
Yes
Yes
Yes
Java
C#
C#
Python
Ruby
EclipseLink
NHibernate
Entity Framework
Django
ActiveRecord
No
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
T HREATS TO VALIDITY
11
this problem by picking systems with various sizes, and include both open source and industrial systems in our study.
Nevertheless, the most important part of our approach is
that it can be adapted to find redundant data problems in
other systems using various ORM frameworks.
8.2
9
8.1
External Validity
Construct Validity
C ONCLUSION
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
simplifies software development. Thus, developers can focus on business logic instead of worrying about non-trivial
database access details. However, ORM mappings introduce
redundant data problems (e.g., the needed data in the code
does not match with the requested data by the ORM framework), which may cause serious performance problems.
In this paper, we proposed an automated approach to
locate the redundant data problems in the code. We also
proposed an automated approach for helping developers
prioritize the efforts on fixing the redundant data problems.
We conducted a case study on two open source and one
enterprise system to evaluate our approaches. We found
that, depending on the workflow, all the redundant data
problems that are discussed in the paper have statistically
significant performance overheads, and developers are concerned about the impacts of these redundant data problems.
Developers do not need to manually locate the redundant
data problems in thousands of lines of code, and can leverage our approach to automatically locate and prioritize the
effort to fix these redundant data problems.
ACKNOWLEDGMENTS
We are grateful to BlackBerry for providing access to the
enterprise system used in our case study. The finding and
opinions expressed in this paper are those of the authors and
do not necessarily represent or reflect those of BlackBerry
and/or its subsidiaries and affiliation. Our results do not in
any way reflect the quality of BlackBerrys products.
R EFERENCES
[1]
Django
objects
values
select
only
some
fields.
http://stackoverflow.com/questions/7071352/
django-objects-values-select-only-some-fields.
Last accessed
March 16, 2016.
[2] FoundationDB. http://community.foundationdb.com/. Last accessed May 16, 2016.
[3] Hibernate : dynamic-update dynamic-insert - performance
effects.
http://stackoverflow.com/questions/
3404630/hibernate-dynamic-update-dynamic-insert\
-performance-effects?lq=1. Last accessed March 16, 2016.
[4] Hibernate
criteria
query
to
get
specific
columns.
http://stackoverflow.com/questions/11626761/
hibernate-criteria-query-to-get-specific-columns. Last accessed
March 16, 2016.
[5] JPA2.0/hibernate: Why JPA fires query to update all
columns value even some states of managed beans
are
changed?
http://stackoverflow.com/questions/
15760934/jpa2-0-hibernate-why-jpa-fires-query-to-update\
-all-columns-value-even-some-stat. Last accessed March 16,
2016.
[6] Making a onetoone-relation lazy. http://stackoverflow.com/
questions/1444227/making-a-onetoone-relation-lazy. Last accessed March 16, 2016.
[7] mysql
how
many
columns
is
too
many?
http://stackoverflow.com/questions/3184478/
how-many-columns-is-too-many-columns. Last accessed March
16, 2016.
[8] NHibernate update on single property updates all
properties in sql.
http://stackoverflow.com/questions/
813240/nhibernate-update-on-single-property-updates-all\
-properties-in-sql. Last accessed March 16, 2016.
[9] Why in JPA Hibernate update query ; all attributes
get update in sql.
http://stackoverflow.com/questions/
10315377/why-in-jpa-hibernate-update-query-all-attributes\
-get-update-in-sql. Last accessed March 16, 2016.
[10] D. Barry and T. Stanienda. Solving the java object storage problem.
Computer, 31(11):3340, Nov. 1998.
[11] R. Binder. Testing Object-oriented Systems: Models, Patterns, and
Tools. Addison-Wesley, 2000.
12
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2553039, IEEE
Transactions on Software Engineering
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
13
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.