Introducing Kewpie - A Feedback-Based Query Generation and Testing Framework

Introducing kewpie – a feedback-
based query generator

patrick crews
gleebix@gmail.com
wc220.com
IRC: pcrews
These slides released under the Creative Commons AttributionNoncommercialShare Alike 3.0 License
What we'll cover

Not just about the new tool
− Background on the problems related to the
functional testing of database systems
−
We'll survey and assess the state of the testing tool
landscape

Applications / strengths / limitations
Why the history lesson?

Testing is a difficult task, particularly for
complex systems like database servers
−
This task presents its own interesting set of
obstacles to overcome.

A little bit of background is necessary to
understand the reasoning behind this
approach.
 “Not ignorance, but ignorance of ignorance is
the death of knowledge”
− Alfred North Whitehead
Where do you get these ideas?

Based on research provided by Microsoft's SQL
Server team
−
One of the only sources for material on this topic
− Their work has provided invaluable insight into
long-term results of strategies and inspiration for
newer testing techniques
 Hands-on experience working with MySQL and
Drizzle
−
Almost 4 years of blood, sweat, and tears ; )
Testing Databases is Crazy Hard!
It's a big task

Table composition (data types and combinations
thereof)

Table population (size and data distribution)

Query space – SQL is expressive!

Table access methods (optimizations, etc)
− index_merge/intersection = the unicorn of optimizer
testing ; )

Effects of various switches (materialization,
semijoin, etc)
It's a really big task

Essentially infinite input space + ever-growing
feature set =
−
Exhaustive testing – 'not gonna happen'
− Need to be smart about what we do test
− Need to be ruthless about what tests we accept as
good
 Maintenance is costly – can't waste time on useless tests
Some additional things

Not easy to unit-test
−
Logical separations well-understood, testing them not
so easy.
 Semantics of the test are as suspect as the code
− Hack up a parse tree for a subquery-heavy bit of SQL?

Time to benefit ratio = not so good / lots of effort
− Our unit-testing GSOC student ran away after the summer and hasn't
come near Drizzle since ; )
 End-to-end testing (SQL queries) most
effective / productive
Focus on functional testing

Other types of tests are important, but having
a really fast server that delivers incorrect
results doesn't matter

One could also argue that such tests evolve
from a set of solid queries that exercise the
server code
 We concern ourselves with useful query / test
case generation
Useful queries and tests?

A test with 1 million SELECT 1's technically
does something, but nothing any user would
likely ever care about.

Additionally, the mysql test suite is filled with
random tests where dev's thought they were
doing something, but there is no definitive
proof.
− one test with 10k rows of data and 2 simple
selects– why?
 It 'seemed' like it was doing something?
− Devolving into superstition at this point
The evolution of testing tools
Understanding history

Let's look at the various functional tools
available
−
Understanding their strengths and limitations helps
us to understand how subsequent tools came to be

Hand-crafted tests
 Random / stochastic testing tools

feedback-based random query generation
− aka genetic algorithms
 mad science...mwa ha ha! ; )
hand-crafted tests
e.g.
drizzle-test-run / mysql-test-run
In the beginning...

How almost all testing starts.

Quick

Easy
 VERY good for targeted testing
−
Easily verified results / limited domain

LENGTH(), SUBSTR()

small, limited functions
 Can apply equivalence class partitioning
− out of lower bounds, good input, out of upper bounds

Still singing the praises

It is how most systems test
− Testing based on this strategy helped make MySQL
into a solid and widely-used product
−
Postgres' test suite is based on similar tests
− Drizzle still uses it for a significant portion of its
own testing
 Significant time and effort have been put into most
tests

Waste not want not
Hand-crafted tests

Can be anything
− we generally view DTR .test files as a case
− mysql_protocol.prototest uses python scripting
 Generally mean a highly targeted test case
that was written by a human and that will
likely require maintenance and extension by
one as well
Example test (slave plugin)
--disable_warnings
DROP TABLE IF EXISTS t1;
--enable_warnings
--echo Populating master server
CREATE TABLE t1 (a int not null auto_increment, primary key(a));
INSERT INTO t1 VALUES (),(),();
SELECT * FROM t1;
….
--echo Connecting to slave...
connect (slave_con,127.0.0.1,root,,test, $BOT0_S1);
echo Using connection slave_con...;
connection slave_con;
--sleep 3
--echo Checking slave contents...
--source include/wait_for_slave_plugin_to_sync.inc
SHOW CREATE TABLE t1;
SELECT * FROM t1;
Example result
DROP TABLE IF EXISTS t1;
Populating master server
CREATE TABLE t1 (a int not null auto_increment, primary key(a));
INSERT INTO t1 VALUES (),(),();
SELECT * FROM t1;
a
1
2
3
Connecting to slave...
Using connection slave_con...
Checking slave contents...
SHOW CREATE TABLE t1;
Table Create Table
t1 CREATE TABLE `t1` (
`a` INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`a`)
) ENGINE=InnoDB COLLATE = utf8_general_ci
SELECT * FROM t1;
a
1
2
3
Switching to default connection
DROP TABLE t1;
So, what's the issue?

Where to begin?
− Scalability / sustainability
 Human effort / attention required to just maintain
existing tests can be considerable

Effort to write new tests, particularly for complicated
features

Microsoft's research has shown an average of .5 hours
for a good, hand-crafted test
− Testing complicated features == EXPENSIVE!
 Also shown that test development time can far exceed
code development time
Bad strategery, cont'd

Lack of coverage
− Due to the nature of human development cycles,
typically simpler cases are created first. As a
result, more complex bugs can be missed until
much later in the development cycle (if at all)
− Microsoft has admitted that bugs were found long
after the defective feature had been rolled out
due to the non-standard circumstances required to
trigger them
Lack of coverage, cont'd

More complex queries, such as the heavy use
of subqueries for optimizer tests, literally
can't be written by hand
−
The effort required to create valid complex
queries isn't worth it
−
Validation is also quite problematic

Will discuss solutions to this problem in a bit...
− If anyone is actually good at these tasks, I am
scared of them

Break out the Turing test!
Crazy-complex queries!
SELECT STRAIGHT_JOIN table1 . `col_text_key` AS field1 , table1 . `col_char_1024_key` AS

field2 , table1 . `col_int_key` AS field3 , table1 . `col_char_10_key` AS field4 , table1 .
`col_text_not_null` AS field5
FROM ( ( SELECT SUBQUERY1_t2 . * FROM AA AS SUBQUERY1_t1 WHERE
SUBQUERY1_t1 . `col_char_1024_key` <= SUBQUERY1_t1 . `col_text_key` ) AS table1
STRAIGHT_JOIN AA AS table2 ON (table2 . `col_int_not_null_key` = table1 . `col_bigint` ) )
WHERE ( table2 . `col_char_1024_not_null` <= ( SELECT MAX( SUBQUERY2_t1 .
`col_char_1024` ) AS SUBQUERY2_field1 FROM BB AS SUBQUERY2_t1 ) ) OR table1 .
`col_text_key` IS NULL GROUP BY field1, field2, field3, field4, field5 ORDER BY field1 ASC
EXPLAIN SELECT table1 . `col_char_10_key` AS field1 FROM ( ( SELECT SUBQUERY1_t1 . *

FROM CC AS SUBQUERY1_t1 ) AS table1 INNER JOIN ( ( ( SELECT DISTINCT
SUBQUERY2_t2 . * FROM AA AS SUBQUERY2_t1 ) AS table2 RIGHT OUTER JOIN A AS table3
ON (table3 . `col_int_not_null_key` = table2 . `col_int_not_null_key` ) ) ) ON (table3 .
`col_text_not_null_key` = table2 . `col_char_1024_key` ) ) WHERE ( NOT EXISTS ( ( SELECT
SUBQUERY3_t2 . `col_int` AS SUBQUERY3_field1 FROM ( AA AS SUBQUERY3_t1
STRAIGHT_JOIN AA AS SUBQUERY3_t2 ON (SUBQUERY3_t2 . `col_int_key` = SUBQUERY3_t1 .
`col_int_not_null_key` ) ) ) ) ) AND table1 . `pk` > 117 AND table1 . `pk` < ( 117 + 135 ) GROUP
BY field1 ORDER BY field1 , field1 LIMIT 10 OFFSET 2
stochastic-based testing
e.g.
random query generator
Technology to the rescue!

1998 (yes, 1998!) - Microsoft publishes a paper
outlining the RAGS system
−
Brute force, intelligently applied
− Automated tool for query generation
 Allows rapid generation of complex test queries
− Microsoft research refers to a 1 Million fold increase in query
volume

Recognizes that query generation and execution are
essentially mechanical tasks

Leave the human free to use creativity and attention
where it is effective
RAGS

randomly generated queries
− General rules for query construction
 via stochastic parse tree
−
Throw it at the server

Validation through comparison
− Different DBMS's
− Same software w/ different settings
 Crash detection

~50% of generated queries executed / returned results
MySQL vs. the randgen

2007-ish, the random query generator (aka
randgen) is unleashed on the MySQL codebase
−
Based on Microsoft's RAGS research
− Put the hurting on the Falcon storage engine
− Also part of why we are just now seeing 6.0
optimizer features being reintroduced into MySQL
>: )

Admittedly a lot of edge cases
− Broken is broken
−
Lots of edge case bugs are worrisome as well
 What the hell is going on in the code?
Sample randgen grammar
query:
SELECT * FROM _table WHERE int_field comparison_operator _digit ;
comparison_operator:
> | >= | < | <= | > | < | > | < |= | != ;
int_field:
col_int | col_int_key | col_int_not_null | col_int_not_null_key | pk ;
Applications

Good for providing an initial baseline
− Determine if the code is solid enough for a human
to devote craftiness to breaking it
−
Frees QA devs' time for better things
 think up more challenges

work on fine-grained testing (hand-crafted tests)

It is vital to remember that QA is a creative (not
mechanical) task. Outsmarting buggy code requires
time
 “See what shakes out”
Applications, cont'd

Good for testing some things, not others
− Covers a lot of ground, but not so easy to express
certain things in a stochastic manner
−
Optimizer validation = great!
− Transaction log = great!
− Testing complex scenarios = not so great

Drizzledump migration
Now for the bad stuff

There are tradeoffs – can do some things at
the expense of not doing other tasks
−
We trade precision for brute force
− The wrong tool can make a seemingly easy task
much, much more difficult ; )
 Like picking up a dime with a pair of gloves

Again, testing Drizzledump migration
− pcrews.egg_on_face=True
Tests can still be expensive

We can cover a lot of ground for our efforts,
but it still takes development and
maintenance time
−
Creating verifiable tests
− Tuning the tests
 To hit desired code
 To generate valid queries
 Often cyclical
−
Maintenance
 How hard to update or change?
Expensive tests

Optimizer grammars took ~4 months to
produce
−
It was a first effort, but it shows it is not trivial to
become familiar with things

Outer join grammars took ~2 months

Need to figure in feature complexity, value,
etc
Additional costs

Development and tuning
− How easily expanded are the tools as we discover
new ideas / needs?
−
Certain things hard to express
 can't always change the server state as we want

Changing the tests can also be expensive
−
Tradeoff between tuning and robustness
How complex could a test be?
join:
{ $stack->push() }
table_or_join
{ $stack->set("left",$stack->get("result")); }
left_right outer JOIN table_or_join
ON
join_condition ;
join_condition:
int_condition | char_condition ;
int_condition:
{ my $left = $stack->get("left"); my %s=map{$_=>1} @$left; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed =
{ my $right = $stack->get("result"); my %s=map{$_=>1} @$right; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed
{ my $left = $stack->get("left"); my $right = $stack->get("result"); my @n = (); push(@n,@$right);
push(@n,@$left); $stack->pop(\@n); return undef } |
{ my $left = $stack->get("left"); my %s=map{$_=>1} @$left; my @r=(keys %s); my $table_string = $prng-

>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed =
{ my $right = $stack->get("result"); my %s=map{$_=>1} @$right; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_field_name
{ my $left = $stack->get("left"); my $right = $stack->get("result"); my @n = (); push(@n,@$right);
push(@n,@$left); $stack->pop(\@n); return undef }
Test complexity, cont'd
int_field_name:
`pk` | `col_int_key` | `col_int` |
`col_bigint` | `col_bigint_key` |
`col_int_not_null` | `col_int_not_null_key` ;
char_field_name:
`col_char_10` | `col_char_10_key` | `col_text_not_null` | `col_text_not_null_key` |
`col_text_key` | `col_text` | `col_char_10_not_null_key` | `col_char_10_not_null` |
`col_char_1024` | `col_char_1024_key` | `col_char_1024_not_null` |
`col_char_1024_not_null_key` ;
int_indexed:
`pk` | `col_int_key` | `col_bigint_key` | `col_int_not_null_key` ;
char_indexed:
`col_char_1024_key` | `col_char_1024_not_null_key` |
`col_char_10_key` | `col_char_10_not_null_key` ;
Tuning vs. Robustness
# 2011-03-22T20:45:24 Rows returned:
$VAR1 = {
' 0' => 148,
' 1' => 8,
' 2' => 1,
' 3' => 1,
' 4' => 1,
' -1' => 76,
' 10' => 2,
'>10' => 2,
'>100' => 1
};
# 2011-03-22T20:45:24 Rows affected:
$VAR1 = undef;
# 2011-03-22T20:45:24 Explain items:
$VAR1 = undef;
# 2011-03-22T20:45:24 Errors:
$VAR1 = {
'(no error)' => 173,
'Unknown column \'%s\' in \'IN/ALL/ANY subquery\'' => 12,
'Unknown column \'%s\' in \'field list\'' => 37,
'Unknown column \'%s\' in \'having clause\'' => 9,
'Unknown column \'%s\' in \'where clause\'' => 1,
'Unknown table \'%s\'' => 18
};
Valid queries

Was considered a large problem at Microsoft

Have run into similar issues with the randgen

Difficult to express more complex queries /
sets of queries while keeping them valid and
worthwhile
− Makes it harder to hit difficult / rare code paths or
combinations of them
Wasteful / not reusable

Every time we run the randgen, we generate
the same invalid queries
−
Good thing – every run with a given seed = same
data and queries produced

Repeatability is a mantra of QA!
− Bad thing – we waste cycles on queries that don't
make it deep into the system

No way to organize queries so we can filter
them according to criteria
−
At least not yet, randgen devs are a crafty lot!
feedback-based query generation
e.g.
kewpie
Microsoft leads the way again

To overcome the limitations of purely
stochastic systems, adopted a genetic-based
approach

Generate / execute / evaluate / mutate

Uses a variety of feedback from the system
under test to determine 'fitness' of a query
− Keep it?
−
Mutate it further?
Genetic-based testing

Progressive building of valid queries
− SELECT col1 FROM table1;
− SELECT col1 FROM table1 WHERE col2 < 'value'
− SELECT col1 FROM table1 WHERE col2 < 'value'
AND...
 We end up with a set of queries that have
some marked effect on the database
Organizing queries

MS uses a data warehouse of these test queries
− Provides a pool for all new testing efforts
 Have a new measure of 'interesting'?
− Pull some queries and put them through the system!
−
Easily organized / sorted / manipulated
 Provides a set of well-cataloged building blocks for
future tests
kewpie...finally ; )

Drizzle's efforts at creating this technology

Our testing experiences have been in-line with
Microsoft's and we recognize similar needs

kewpie? = query probulator
− Futurama for the win!
The probulator!
kewpie

Still very early in development
− Sorry, it won't make your database webscale
overnight

Hire a marketing department for that ; )
 Idea is to teach something how to create
queries once and then provide a means of
directing the query generation
−
Use of feedback

evaluation functions
− Use of specific mutation patterns
 Favoring some / probability tweaks
Evaluation functions

Check the effects of our query(ies) on the
database
−
code coverage
− gdb output (Igor delta debugger project)
− EXPLAIN plans
−
changes in select variables
−
log output
− custom code instrumentation

The possibilities are endless!
kewpie, cont'd

Written in Python

Currently tightly integrated with dbqp.py,
Drizzle's experimental test runner
−
Likely to become a more separate tool over time
− Expedience and all of that

Originally based on SQLAlchemy
−
They try to help you succeed a bit too much for
nefarious testing purposes ; )
Design ideas

First and foremost a query generator
− Create good, effective, well-cataloged queries
− Building blocks for more complex tests
 Stress tests

Performance
 Durability

etc
Design ideas, cont'd

We want to have a 'query' be:
− easily manipulated
− easily analyzed / broken down
 Provide a robust set of functions for working
with query objects
−
addColumn(type=None, aggregate=False...)
−
addTable(name=None,rowCount=None...)

Lots of knobs for tuning things
What can it do?

Not a lot quite yet...: /
− Generation of SELECT lists, certain JOINs and
WHERE conditions
−
Still a lot to do...was a bit distracted:
 We had this GA release thing we were working on...
Structure

As with all dbqp 'modes', we have a custom
test executor and test manager
−
testManager
 what does a testCase look like / how to package the
relevant data for execution

manages testCases for the testExecutor
− testExecutor
 Set up for the test
 Execute it

Evaluate the results
Structure

query_manager:
− populates and manipulates queries
 add_tables()
− add a given number of tables to the query object from what is
available in the test bed

add_columns()
− add a column from the tables used in the query
 add_where()
− add a where clause using an available column from the tables
used in the query
Structure

query generation = all about the tables
− we center everything on this as it determines what
columns are available for valid queries
−
generating invalid SQL will be necessary and useful
at some point, but it is entirely too easy to pick a
column not used in the query

The randgen requires you to pick a bit blindly
in terms of column / table combinations
− means invalid queries (boo, hiss!)
Structure

query_evaluator
− Runs the various bits of evaluator code
− Currently very primitive
 Only have row_count evaluator

Can add other evaluations as needed
 Will eventually need proper fitness functions to
determine which queries will 'live'

− Not all evaluators are as simple as pass/fail.
− optimizer table access methods might require building blocks
that don't hit the targeted optimization initially
test structure

We prime the system via a python cnf file
− Determines initial query population and their
structure

All generated on-the-fly for now
 Eventually will be able to pull from a database of test
queries, files, etc

− Determines how we mutate a query
 what are we allowed to do to change things
− Designed to help guide generation in a desired fashion
 Large, valid JOIN operations for example
configuration file
[test_info]
comment = basic test of kewpie seeding
test_schema = test
init_query_count = 5
init_table_count = 2
init_column_count = 4
init_where_count = 0
# limits for various values

max_table_count = 10
min_column_count = 1
max_column_count = 25
max_where_count = 10
max_mutate_count = 10
[mutators]
add_table = 2
add_column = 4
add_where = 3
[test_servers]
servers = [[--innodb.replication-log]]
[evaluators]
row_count = True
explain_output = False
test execution

We create an initial set of queries

We then execute each query

If it passes evaluation, we then create a copy
− The original good query can serve as a seed for
further mutations
− The copy is mutated and executed
 We use max_mutate_count to limit query
lifespan (no endless runs)
Next steps

How long do you have to listen? ; )
− database storage and retrieval of queries
− more fine grained control over query generation
− more extensive / complex query generation via
query mixing
 subqueries

unions
− trimming the query pool
Next steps...still

More evaluation code
− gcov, gdb, EXPLAIN...
 Fitness functions
 The list goes on
− As mentioned earlier, the test domain is
essentially infinite
− code will evolve to solve problems
Demo Time!
Summary
Testing Challenge
drizzle-test-run/mysql-test-run
random query generator
kewpie
References
References
 Microsoft
−
RAGS –“Massive Stochastic Testing of SQL”
http://research.microsoft.com/pubs/69660/tr-98-21.ps
−
Genetic testing - “A genetic approach for random testing of database systems”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3435&rep=rep1&type=pdf

Random query generator
− https://launchpad.net/randgen
− http://forge.mysql.com/wiki/RandomQueryGenerator
− http://datacharmer.blogspot.com/2008/12/guest-post-philip-stoev-if-
you-love-it.html
− http://carotid.blogspot.com/2008_09_01_archive.html#521833683342
482424
References, cont'd
 Drizzle + Drizzle testing tools
− http://drizzle.org/
−
https://launchpad.net/drizzle
−
http://docs.drizzle.org/testing/test-run.html
−
http://docs.drizzle.org/testing/dbqp.html
−
http://docs.drizzle.org/testing/randgen.html

kewpie_demo_tree:
−
lp:~patrick-crews/drizzle/dbqp_kewpie_demo

Introducing Kewpie - A Feedback-Based Query Generation and Testing Framework

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introducing Kewpie - A Feedback-Based Query Generation and Testing Framework

Uploaded by

Copyright:

Available Formats

Introducing kewpie – a feedback-

based query generator

− out of lower bounds, good input, out of upper bounds

SELECT STRAIGHT_JOIN table1 . `col_text_key` AS field1 , table1 . `col_char_1024_key` AS

EXPLAIN SELECT table1 . `col_char_10_key` AS field1 FROM ( ( SELECT SUBQUERY1_t1 . *

{ my $left = $stack->get("left"); my %s=map{$_=>1} @$left; my @r=(keys %s); my $table_string = $prng-

determine which queries will 'live'

queries, files, etc

# limits for various values

You might also like