You are on page 1of 68

Introducing kewpie – a feedback-

based query generator


patrick crews
gleebix@gmail.com
wc220.com
IRC: pcrews

These slides released under the Creative Commons Attribution­Noncommercial­Share Alike 3.0 License
What we'll cover


Not just about the new tool
− Background on the problems related to the
functional testing of database systems

We'll survey and assess the state of the testing tool
landscape

Applications / strengths / limitations
Why the history lesson?


Testing is a difficult task, particularly for
complex systems like database servers

This task presents its own interesting set of
obstacles to overcome.

A little bit of background is necessary to
understand the reasoning behind this
approach.
 “Not ignorance, but ignorance of ignorance is
the death of knowledge”
− Alfred North Whitehead
Where do you get these ideas?


Based on research provided by Microsoft's SQL
Server team

One of the only sources for material on this topic
− Their work has provided invaluable insight into
long-term results of strategies and inspiration for
newer testing techniques
 Hands-on experience working with MySQL and
Drizzle

Almost 4 years of blood, sweat, and tears ; )
Testing Databases is Crazy Hard!
It's a big task


Table composition (data types and combinations
thereof)

Table population (size and data distribution)

Query space – SQL is expressive!

Table access methods (optimizations, etc)
− index_merge/intersection = the unicorn of optimizer
testing ; )

Effects of various switches (materialization,
semijoin, etc)
It's a really big task


Essentially infinite input space + ever-growing
feature set =

Exhaustive testing – 'not gonna happen'
− Need to be smart about what we do test
− Need to be ruthless about what tests we accept as
good
 Maintenance is costly – can't waste time on useless tests
Some additional things


Not easy to unit-test

Logical separations well-understood, testing them not
so easy.
 Semantics of the test are as suspect as the code
− Hack up a parse tree for a subquery-heavy bit of SQL?

Time to benefit ratio = not so good / lots of effort
− Our unit-testing GSOC student ran away after the summer and hasn't
come near Drizzle since ; )
 End-to-end testing (SQL queries) most
effective / productive
Focus on functional testing


Other types of tests are important, but having
a really fast server that delivers incorrect
results doesn't matter

One could also argue that such tests evolve
from a set of solid queries that exercise the
server code
 We concern ourselves with useful query / test
case generation
Useful queries and tests?


A test with 1 million SELECT 1's technically
does something, but nothing any user would
likely ever care about.

Additionally, the mysql test suite is filled with
random tests where dev's thought they were
doing something, but there is no definitive
proof.
− one test with 10k rows of data and 2 simple
selects– why?
 It 'seemed' like it was doing something?
− Devolving into superstition at this point
The evolution of testing tools
Understanding history


Let's look at the various functional tools
available

Understanding their strengths and limitations helps
us to understand how subsequent tools came to be

Hand-crafted tests
 Random / stochastic testing tools


feedback-based random query generation
− aka genetic algorithms
 mad science...mwa ha ha! ; )
hand-crafted tests
e.g.
drizzle-test-run / mysql-test-run
In the beginning...


How almost all testing starts.

Quick

Easy
 VERY good for targeted testing

Easily verified results / limited domain

LENGTH(), SUBSTR()

small, limited functions
 Can apply equivalence class partitioning

− out of lower bounds, good input, out of upper bounds


Still singing the praises


It is how most systems test
− Testing based on this strategy helped make MySQL
into a solid and widely-used product

Postgres' test suite is based on similar tests
− Drizzle still uses it for a significant portion of its
own testing
 Significant time and effort have been put into most
tests

Waste not want not
Hand-crafted tests


Can be anything
− we generally view DTR .test files as a case
− mysql_protocol.prototest uses python scripting
 Generally mean a highly targeted test case
that was written by a human and that will
likely require maintenance and extension by
one as well
Example test (slave plugin)

--disable_warnings
DROP TABLE IF EXISTS t1;
--enable_warnings
--echo Populating master server
CREATE TABLE t1 (a int not null auto_increment, primary key(a));
INSERT INTO t1 VALUES (),(),();
SELECT * FROM t1;
….
--echo Connecting to slave...
connect (slave_con,127.0.0.1,root,,test, $BOT0_S1);
echo Using connection slave_con...;
connection slave_con;
--sleep 3
--echo Checking slave contents...
--source include/wait_for_slave_plugin_to_sync.inc
SHOW CREATE TABLE t1;
SELECT * FROM t1;
Example result
DROP TABLE IF EXISTS t1;
Populating master server
CREATE TABLE t1 (a int not null auto_increment, primary key(a));
INSERT INTO t1 VALUES (),(),();
SELECT * FROM t1;
a
1
2
3
Connecting to slave...
Using connection slave_con...
Checking slave contents...
SHOW CREATE TABLE t1;
Table Create Table
t1 CREATE TABLE `t1` (
`a` INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`a`)
) ENGINE=InnoDB COLLATE = utf8_general_ci
SELECT * FROM t1;
a
1
2
3
Switching to default connection
DROP TABLE t1;
So, what's the issue?


Where to begin?
− Scalability / sustainability
 Human effort / attention required to just maintain
existing tests can be considerable

Effort to write new tests, particularly for complicated
features

Microsoft's research has shown an average of .5 hours
for a good, hand-crafted test
− Testing complicated features == EXPENSIVE!
 Also shown that test development time can far exceed
code development time
Bad strategery, cont'd


Lack of coverage
− Due to the nature of human development cycles,
typically simpler cases are created first. As a
result, more complex bugs can be missed until
much later in the development cycle (if at all)
− Microsoft has admitted that bugs were found long
after the defective feature had been rolled out
due to the non-standard circumstances required to
trigger them
Lack of coverage, cont'd


More complex queries, such as the heavy use
of subqueries for optimizer tests, literally
can't be written by hand

The effort required to create valid complex
queries isn't worth it

Validation is also quite problematic

Will discuss solutions to this problem in a bit...
− If anyone is actually good at these tasks, I am
scared of them

Break out the Turing test!
Crazy-complex queries!

SELECT STRAIGHT_JOIN table1 . `col_text_key` AS field1 , table1 . `col_char_1024_key` AS


field2 , table1 . `col_int_key` AS field3 , table1 . `col_char_10_key` AS field4 , table1 .
`col_text_not_null` AS field5
FROM ( ( SELECT SUBQUERY1_t2 . * FROM AA AS SUBQUERY1_t1 WHERE
SUBQUERY1_t1 . `col_char_1024_key` <= SUBQUERY1_t1 . `col_text_key` ) AS table1
STRAIGHT_JOIN AA AS table2 ON (table2 . `col_int_not_null_key` = table1 . `col_bigint` ) )
WHERE ( table2 . `col_char_1024_not_null` <= ( SELECT MAX( SUBQUERY2_t1 .
`col_char_1024` ) AS SUBQUERY2_field1 FROM BB AS SUBQUERY2_t1 ) ) OR table1 .
`col_text_key` IS NULL GROUP BY field1, field2, field3, field4, field5 ORDER BY field1 ASC

EXPLAIN SELECT table1 . `col_char_10_key` AS field1 FROM ( ( SELECT SUBQUERY1_t1 . *


FROM CC AS SUBQUERY1_t1 ) AS table1 INNER JOIN ( ( ( SELECT DISTINCT
SUBQUERY2_t2 . * FROM AA AS SUBQUERY2_t1 ) AS table2 RIGHT OUTER JOIN A AS table3
ON (table3 . `col_int_not_null_key` = table2 . `col_int_not_null_key` ) ) ) ON (table3 .
`col_text_not_null_key` = table2 . `col_char_1024_key` ) ) WHERE ( NOT EXISTS ( ( SELECT
SUBQUERY3_t2 . `col_int` AS SUBQUERY3_field1 FROM ( AA AS SUBQUERY3_t1
STRAIGHT_JOIN AA AS SUBQUERY3_t2 ON (SUBQUERY3_t2 . `col_int_key` = SUBQUERY3_t1 .
`col_int_not_null_key` ) ) ) ) ) AND table1 . `pk` > 117 AND table1 . `pk` < ( 117 + 135 ) GROUP
BY field1 ORDER BY field1 , field1 LIMIT 10 OFFSET 2
stochastic-based testing
e.g.
random query generator
Technology to the rescue!


1998 (yes, 1998!) - Microsoft publishes a paper
outlining the RAGS system

Brute force, intelligently applied
− Automated tool for query generation
 Allows rapid generation of complex test queries
− Microsoft research refers to a 1 Million fold increase in query
volume

Recognizes that query generation and execution are
essentially mechanical tasks

Leave the human free to use creativity and attention
where it is effective
RAGS


randomly generated queries
− General rules for query construction
 via stochastic parse tree

Throw it at the server

Validation through comparison
− Different DBMS's
− Same software w/ different settings
 Crash detection

~50% of generated queries executed / returned results
MySQL vs. the randgen


2007-ish, the random query generator (aka
randgen) is unleashed on the MySQL codebase

Based on Microsoft's RAGS research
− Put the hurting on the Falcon storage engine
− Also part of why we are just now seeing 6.0
optimizer features being reintroduced into MySQL
>: )

Admittedly a lot of edge cases
− Broken is broken

Lots of edge case bugs are worrisome as well
 What the hell is going on in the code?
Sample randgen grammar

query:
SELECT * FROM _table WHERE int_field comparison_operator _digit ;

comparison_operator:
> | >= | < | <= | > | < | > | < |= | != ;

int_field:
col_int | col_int_key | col_int_not_null | col_int_not_null_key | pk ;
Applications


Good for providing an initial baseline
− Determine if the code is solid enough for a human
to devote craftiness to breaking it

Frees QA devs' time for better things
 think up more challenges

work on fine-grained testing (hand-crafted tests)

It is vital to remember that QA is a creative (not
mechanical) task. Outsmarting buggy code requires
time
 “See what shakes out”
Applications, cont'd


Good for testing some things, not others
− Covers a lot of ground, but not so easy to express
certain things in a stochastic manner

Optimizer validation = great!
− Transaction log = great!
− Testing complex scenarios = not so great

Drizzledump migration
Now for the bad stuff


There are tradeoffs – can do some things at
the expense of not doing other tasks

We trade precision for brute force
− The wrong tool can make a seemingly easy task
much, much more difficult ; )
 Like picking up a dime with a pair of gloves

Again, testing Drizzledump migration
− pcrews.egg_on_face=True
Tests can still be expensive


We can cover a lot of ground for our efforts,
but it still takes development and
maintenance time

Creating verifiable tests
− Tuning the tests
 To hit desired code
 To generate valid queries

 Often cyclical


Maintenance
 How hard to update or change?
Expensive tests


Optimizer grammars took ~4 months to
produce

It was a first effort, but it shows it is not trivial to
become familiar with things

Outer join grammars took ~2 months

Need to figure in feature complexity, value,
etc
Additional costs


Development and tuning
− How easily expanded are the tools as we discover
new ideas / needs?

Certain things hard to express
 can't always change the server state as we want

Changing the tests can also be expensive

Tradeoff between tuning and robustness
How complex could a test be?

join:
{ $stack->push() }
table_or_join
{ $stack->set("left",$stack->get("result")); }
left_right outer JOIN table_or_join
ON
join_condition ;

join_condition:
int_condition | char_condition ;

int_condition:
{ my $left = $stack->get("left"); my %s=map{$_=>1} @$left; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed =
{ my $right = $stack->get("result"); my %s=map{$_=>1} @$right; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed
{ my $left = $stack->get("left"); my $right = $stack->get("result"); my @n = (); push(@n,@$right);
push(@n,@$left); $stack->pop(\@n); return undef } |

{ my $left = $stack->get("left"); my %s=map{$_=>1} @$left; my @r=(keys %s); my $table_string = $prng-


>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed =
{ my $right = $stack->get("result"); my %s=map{$_=>1} @$right; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_field_name
{ my $left = $stack->get("left"); my $right = $stack->get("result"); my @n = (); push(@n,@$right);
push(@n,@$left); $stack->pop(\@n); return undef }
Test complexity, cont'd

int_field_name:
`pk` | `col_int_key` | `col_int` |
`col_bigint` | `col_bigint_key` |
`col_int_not_null` | `col_int_not_null_key` ;

char_field_name:
`col_char_10` | `col_char_10_key` | `col_text_not_null` | `col_text_not_null_key` |
`col_text_key` | `col_text` | `col_char_10_not_null_key` | `col_char_10_not_null` |
`col_char_1024` | `col_char_1024_key` | `col_char_1024_not_null` |
`col_char_1024_not_null_key` ;

int_indexed:
`pk` | `col_int_key` | `col_bigint_key` | `col_int_not_null_key` ;

char_indexed:
`col_char_1024_key` | `col_char_1024_not_null_key` |
`col_char_10_key` | `col_char_10_not_null_key` ;
Tuning vs. Robustness
# 2011-03-22T20:45:24 Rows returned:
$VAR1 = {
' 0' => 148,
' 1' => 8,
' 2' => 1,
' 3' => 1,
' 4' => 1,
' -1' => 76,
' 10' => 2,
'>10' => 2,
'>100' => 1
};
# 2011-03-22T20:45:24 Rows affected:
$VAR1 = undef;
# 2011-03-22T20:45:24 Explain items:
$VAR1 = undef;
# 2011-03-22T20:45:24 Errors:
$VAR1 = {
'(no error)' => 173,
'Unknown column \'%s\' in \'IN/ALL/ANY subquery\'' => 12,
'Unknown column \'%s\' in \'field list\'' => 37,
'Unknown column \'%s\' in \'having clause\'' => 9,
'Unknown column \'%s\' in \'where clause\'' => 1,
'Unknown table \'%s\'' => 18
};
Valid queries


Was considered a large problem at Microsoft

Have run into similar issues with the randgen

Difficult to express more complex queries /
sets of queries while keeping them valid and
worthwhile
− Makes it harder to hit difficult / rare code paths or
combinations of them
Wasteful / not reusable


Every time we run the randgen, we generate
the same invalid queries

Good thing – every run with a given seed = same
data and queries produced

Repeatability is a mantra of QA!
− Bad thing – we waste cycles on queries that don't
make it deep into the system

No way to organize queries so we can filter
them according to criteria

At least not yet, randgen devs are a crafty lot!
feedback-based query generation
e.g.
kewpie
Microsoft leads the way again


To overcome the limitations of purely
stochastic systems, adopted a genetic-based
approach

Generate / execute / evaluate / mutate

Uses a variety of feedback from the system
under test to determine 'fitness' of a query
− Keep it?

Mutate it further?
Genetic-based testing


Progressive building of valid queries
− SELECT col1 FROM table1;
− SELECT col1 FROM table1 WHERE col2 < 'value'
− SELECT col1 FROM table1 WHERE col2 < 'value'
AND...
 We end up with a set of queries that have
some marked effect on the database
Organizing queries


MS uses a data warehouse of these test queries
− Provides a pool for all new testing efforts
 Have a new measure of 'interesting'?
− Pull some queries and put them through the system!

Easily organized / sorted / manipulated
 Provides a set of well-cataloged building blocks for
future tests
kewpie...finally ; )


Drizzle's efforts at creating this technology

Our testing experiences have been in-line with
Microsoft's and we recognize similar needs

kewpie? = query probulator
− Futurama for the win!
The probulator!
kewpie


Still very early in development
− Sorry, it won't make your database webscale
overnight

Hire a marketing department for that ; )
 Idea is to teach something how to create
queries once and then provide a means of
directing the query generation

Use of feedback

evaluation functions
− Use of specific mutation patterns
 Favoring some / probability tweaks
Evaluation functions


Check the effects of our query(ies) on the
database

code coverage
− gdb output (Igor delta debugger project)
− EXPLAIN plans

changes in select variables

log output
− custom code instrumentation

The possibilities are endless!
kewpie, cont'd


Written in Python

Currently tightly integrated with dbqp.py,
Drizzle's experimental test runner

Likely to become a more separate tool over time
− Expedience and all of that

Originally based on SQLAlchemy

They try to help you succeed a bit too much for
nefarious testing purposes ; )
Design ideas


First and foremost a query generator
− Create good, effective, well-cataloged queries
− Building blocks for more complex tests
 Stress tests

Performance
 Durability


etc
Design ideas, cont'd


We want to have a 'query' be:
− easily manipulated
− easily analyzed / broken down
 Provide a robust set of functions for working
with query objects

addColumn(type=None, aggregate=False...)

addTable(name=None,rowCount=None...)

Lots of knobs for tuning things
What can it do?


Not a lot quite yet...: /
− Generation of SELECT lists, certain JOINs and
WHERE conditions

Still a lot to do...was a bit distracted:
 We had this GA release thing we were working on...
Structure


As with all dbqp 'modes', we have a custom
test executor and test manager

testManager
 what does a testCase look like / how to package the
relevant data for execution

manages testCases for the testExecutor
− testExecutor
 Set up for the test
 Execute it


Evaluate the results
Structure


query_manager:
− populates and manipulates queries
 add_tables()
− add a given number of tables to the query object from what is
available in the test bed

add_columns()
− add a column from the tables used in the query
 add_where()
− add a where clause using an available column from the tables
used in the query
Structure


query generation = all about the tables
− we center everything on this as it determines what
columns are available for valid queries

generating invalid SQL will be necessary and useful
at some point, but it is entirely too easy to pick a
column not used in the query

The randgen requires you to pick a bit blindly
in terms of column / table combinations
− means invalid queries (boo, hiss!)
Structure


query_evaluator
− Runs the various bits of evaluator code
− Currently very primitive
 Only have row_count evaluator

Can add other evaluations as needed
 Will eventually need proper fitness functions to

determine which queries will 'live'


− Not all evaluators are as simple as pass/fail.
− optimizer table access methods might require building blocks
that don't hit the targeted optimization initially
test structure


We prime the system via a python cnf file
− Determines initial query population and their
structure

All generated on-the-fly for now
 Eventually will be able to pull from a database of test

queries, files, etc


− Determines how we mutate a query
 what are we allowed to do to change things
− Designed to help guide generation in a desired fashion
 Large, valid JOIN operations for example
configuration file
[test_info]
comment = basic test of kewpie seeding
test_schema = test
init_query_count = 5
init_table_count = 2
init_column_count = 4
init_where_count = 0

# limits for various values


max_table_count = 10
min_column_count = 1
max_column_count = 25
max_where_count = 10
max_mutate_count = 10

[mutators]
add_table = 2
add_column = 4
add_where = 3

[test_servers]
servers = [[--innodb.replication-log]]

[evaluators]
row_count = True
explain_output = False
test execution


We create an initial set of queries

We then execute each query

If it passes evaluation, we then create a copy
− The original good query can serve as a seed for
further mutations
− The copy is mutated and executed
 We use max_mutate_count to limit query
lifespan (no endless runs)
Next steps


How long do you have to listen? ; )
− database storage and retrieval of queries
− more fine grained control over query generation
− more extensive / complex query generation via
query mixing
 subqueries

unions
− trimming the query pool
Next steps...still


More evaluation code
− gcov, gdb, EXPLAIN...
 Fitness functions
 The list goes on
− As mentioned earlier, the test domain is
essentially infinite
− code will evolve to solve problems
Demo Time!
Summary
Testing Challenge
drizzle-test-run/mysql-test-run
random query generator
kewpie
References
References
 Microsoft

RAGS –“Massive Stochastic Testing of SQL”
http://research.microsoft.com/pubs/69660/tr-98-21.ps

Genetic testing - “A genetic approach for random testing of database systems”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3435&rep=rep1&type=pdf

Random query generator
− https://launchpad.net/randgen
− http://forge.mysql.com/wiki/RandomQueryGenerator
− http://datacharmer.blogspot.com/2008/12/guest-post-philip-stoev-if-
you-love-it.html
− http://carotid.blogspot.com/2008_09_01_archive.html#521833683342
482424
References, cont'd
 Drizzle + Drizzle testing tools
− http://drizzle.org/

https://launchpad.net/drizzle

http://docs.drizzle.org/testing/test-run.html

http://docs.drizzle.org/testing/dbqp.html

http://docs.drizzle.org/testing/randgen.html

kewpie_demo_tree:

lp:~patrick-crews/drizzle/dbqp_kewpie_demo

You might also like