Hive is a powerful data warehousing application built on top of Hadoop which allows you to use SQL to access your data. This lecture will give an overview of Hive and the query language. This lecture includes a work-along exercise. The next video is a screencast of a user performing this exercise.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
Hive is a powerful data warehousing application built on top of Hadoop which allows you to use SQL to access your data. This lecture will give an overview of Hive and the query language. This lecture includes a work-along exercise. The next video is a screencast of a user performing this exercise.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
Hive is a powerful data warehousing application built on top of Hadoop which allows you to use SQL to access your data. This lecture will give an overview of Hive and the query language. This lecture includes a work-along exercise. The next video is a screencast of a user performing this exercise.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
Background • Started at Facebook • Data was collected by nightly cron jobs into Oracle DB • “ETL” via hand-coded python • Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that.
Hadoop as Enterprise Data Warehouse • Scribe and MySQL data loaded into Hadoop HDFS • Hadoop MapReduce jobs to process data • Missing components: – Command-line interface for “end users” – Ad-hoc query support • … without writing full MapReduce jobs – Schema information
Metastore • Database: namespace containing a set of tables • Holds table definitions (column types, physical layout) • Partition data • Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases
Physical Layout • Warehouse directory in HDFS – e.g., /home/hive/warehouse • Tables stored in subdirectories of warehouse – Partitions, buckets form subdirectories of tables • Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format
Running the join hive> INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1;
Some more advanced features… • “TRANSFORM:” Can use MapReduce in SQL statements • Custom SerDe: Can use arbitrary file formats • Metastore check tool • Structured query log
Project Status • Open source, Apache 2.0 license • Official subproject of Apache Hadoop • 4 committers (all from Facebook) • First release candidate coming soon