You are on page 1of 31

MySQL Data Warehousing

Survival Guide
Marius Moscovici (marius@metricinsights.com)
Steffan Mejia (steffan@lindenlab.com)
Topics

• The size of the beast


 
• Evolution of a Warehouse
 
• Lessons Learned
 
• Survival Tips
 
• Q&A
Size of the beast

• 43 Servers
o 36 active
o 7 standby spares
• 16 TB of data in MySQL
• 12 TB archived (pre S3 staging)
• 4 TB archived (S3) 
• 3.5B rows in main warehouse 
• Largest table ~ 500M rows (MySQL)
Warehouse Evolution - First came slaving

Problems:

• Reporting slaves easily fall   


behind
 
• Reporting limited to one-pass
SQL
Warehouse Evolution - Then came temp tables

Problems:
 
• Easy to lock replication with
temp table creation 
 
• Slaving becomes fragile
Warehouse Evolution - A Warehouse is Born
Problems:
 
• Warehouse workload limited
by what can be performed by a
single server
Warehouse Evolution - Workload Distributed
Problems:
 
• No Real-Time
Application
integration support
Warehouse Evolution - Integrate Real Time Data
Lessons Learned - Warehouse Design

Workload exceeds
available memory
Lessons Learned - Warehouse Design
• Keep joins < available memory

• Heavily Denormalize data for effective reporting

• Minimize joins between large tables

• Aggressively archive historical data 


Lessons Learned - Data Movement
• Mysqldump is your friend

• Sequence parent/child data loads based on ETL


assumptions
o Orders without order lines
o Order lines without orders

• Data Movement Use Cases


o Full
o Incremental
o Upsert (Insert on duplicate key update)
Full Table Loads
• Good for small tables

• Works for tables with no primary key 

• Data is fully replaced on each load


Incremental Loads
• Table contains new rows but no updates

• Good for insert-only tables

• High-water mark level included in Mysqldump


where clause
Upsert Loads
• Table contains new and updated rows

• Table must have primary key

• Can be used to update only subset of columns


Lessons Learned - ETL Design
• Avoid large joins like the plague

• Break out ETL jobs into bite-size-bites

• Ensure target data integrity on ETL failure

• Use memory staging tables to boost performance 


ETL Design - Sample Problem
Build a daily summary of customer event log activity
ETL Design - Sample Solution
ETL Pseudo code - Step 1
1) Create staging table & Find High Water Mark:
SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO
@last_loaded_date 
FROM user_event_log_summary;

set max_heap_table_size = <big enough number to hold several days


data>

CREATE TEMPORARY TABLE user_event_log_summary_staging


(.....)
ENGINE = MEMORY;

CREATE INDEX user_idx  


USING HASH on user_event_log_summary_staging(user_id);
ETL Pseudo code - Step 2
2) Summarize events:
INSERT INTO user_event_log_summary_staging (
calendar_date, 
user_id, 
event_type, 
event_count)

SELECT 
DATE(event_time), 
user_id, 
event_type, 
COUNT(*)
FROM event_log
WHERE event_time > CONCAT(@last_loaded_date, '23:59:59')
GROUP BY 1,2,3;
ETL Pseudo code - Step 3
3) Set denormalized user columns:
UPDATE user_event_log_summary_staging log_summary,
              user
SET log_summary.type = user.type,
     log_summary.status = user.status
WHERE user.user_id = log_summary.user_id;
ETL Pseudo code - Step 4
3) Insert into Target Table:
INSERT INTO user_event_log_summary
(...)
SELECT ...
FROM user_event_log_summary_staging;
Functional Partitioning

• Benefits depend on

o Partition Execution Times

o Data Move Times

o Dependencies between functional partitions


Functional Partitioning
Job Management

• Run everything single-threaded on a server

• Handle dependencies between jobs across servers

• Smart re-start key to survival

• Implemented 3-level hierarchy of processing


o Process (collection of build steps and data moves)
o Build Steps (ETL 'units of work')
o Data Moves
DW Replication

• Similar to other MySQL environments


o Commodity hardware 
o Master-slave pairs for all databases

• Mixed environments can be difficult


o Use rsync to create slaves
o But not with ssh (on private network)
 
• Monitoring 
o Reporting queries need to be monitored
 Beware of blocking queries
 Only run reporting queries on slave (temp table issues)
o Nagios
o Ganglia
o Custom scripts
Infrastructure Planning

• Replication latency
o Warehouse slave unable to keep up
o Disk utilization > 95%
o Required frequent re-sync
 
 
• Options evaluated
o  Higher speed conventional disks
o  RAM increase
o  Solid-state-disks
Optimization

• Check / reset HW RAID settings


• Use general query log to track ETL / Queries
• Application timing 
o isolate poor-performing parts of the build
• Optimize data storage - automatic roll-off of older data
Infrastructure Changes

• Increased memory 32GB -> 64GB


• New servers have 96GB RAM

• SSD Solution
o 12 & 16 disk configurations
o RAID6 vs. RAID10
o 2.0T or 1.6TB formatted capacity
o SATA2 HW BBU RAID6
o ~ 8 TB data on SSD
Results

• Sometimes it pays to throw hardware at a problem


o 15-hour warehouse builds on old system
o 6 hours on optimized system
o No application changes
Finally...Archive

Two-tiered solution
• Move data into archive tables in separate DB
• Use select to dump data - efficient and fast
• Archive server handles migration
o Dump data
o GPG
o Push to S3
Survival Tips

• Efforts to scale are non-linear


o As you scale, it becomes increasingly difficult to manage
o Be prepared to supplement your warehouse strategy
 Dedicated appliance
 Distributed processing (Hadoop, etc)
• You can gain a great deal of headroom by optimizing I/O
o Optimize current disk I/O path
o Examine SSD / Flash solutions
o Be pragmatic about table designs
• It's important to stay ahead of the performance curve
o Be proactive - monitor growth, scale early
• Monitor everything, including your users
o Bad queries can bring replication down

You might also like