Professional Documents
Culture Documents
Yasin N. Silva
Isadora Almeida
Michell Queiroz
ysilva@asu.edu
isilvaal@asu.edu
mfelippe@asu.edu
ABSTRACT
The Structured Query Language (SQL) is the main programing
language designed to manage data stored in database systems.
While SQL was initially used only with relational database
management systems (RDBMS), its use has been significantly
extended with the advent of new types of database systems.
Specifically, SQL has been found to be a powerful query language
in highly distributed and scalable systems that process Big Data,
i.e., datasets with high volume, velocity and variety. While
traditional relational databases represent now only a small fraction
of the database systems landscape, most database courses that cover
SQL consider only the use of SQL in the context of traditional
relational systems. In this paper, we propose teaching SQL as a
general language that can be used in a broad range of database
systems from traditional RDBMSs to Big Data systems. This paper
presents well-structured guidelines to introduce SQL in the context
of new types of database systems including MapReduce, NoSQL
and NewSQL. A key contribution of this paper is the description of
an array of course resources, e.g., virtual machines, sample
projects, and in-class exercises, to enable a hands-on experience
with SQL across a broad set of modern database systems.
General Terms
Design, Experimentation
Keywords
Databases curricula; SQL; structured query language; Big Data
1. INTRODUCTION
The Structured Query Language (SQL) is the most extensively used
database language. SQL is composed of a data definition language
(DDL), which allows the specification of database schemas; a data
manipulation language (DML), which supports operations to
retrieve, store, modify and delete data; and a data control language
(DCL), which enables database administrators to configure security
access to databases. Among the most important reasons for SQLs
wide adoption are that: (1) it is primarily a declarative language,
that is, it specifies the logic of a program (what needs to be done)
instead of the control flow (how to do it); (2) it is relatively simple
to learn and understand because it is declarative and uses English
statements; (3) it is a standard of the American National Standards
Institute (ANSI) and the International Organization for
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
SIGCSE16, March 25, 2016, Memphis, Tennessee, USA.
Copyright 2016 ACM 1-58113-000-0/00/0010 $15.00.
C1:sudohive
C2:CREATEdatabaseMStation;
C1:wgethttp://realchart.finance.yahoo.com/table.csv?
s=AAPL&d=6&e=4&f=2015&g=d&a=11&b=12&c=1980&ignore=.csv
C2:mvtable.csv?s=AAPLtable.csv
C3:CREATETABLEstationData(stationedint,zipcodeint,
latitudedouble,longitudedouble,stationnamestring);
C3:hadoopfsput./table.csv/data/
C4:CREATETABLEweatherReport(stationidint,tempdouble,
humidouble,precipdouble,yearint,monthstring);
C4:sparkshellmasteryarnclientdrivermemory512m
executormemory512m
C5:LOADDATALOCALINPATH'/home/cloudera/datastation/
stationData.txt'intotablestationData;
C6:LOADDATALOCALINPATH'/home/cloudera/datastation/
weatherReport.txt'intotableweatherReport;
Q1:SELECTS.zipcode,AVG(W.precip)FROMstationDataSJOIN
weatherReportWONS.stationid=W.stationidGROUPBY
S.zipcode;
Q2:SELECTstationid,AVG(humi)FROMweatherReportGROUPBY
stationid;
Q3:SELECTstationid,MIN(temp),MAX(temp)FROMweatherReport
WHEREyear>2000GROUPBYstationid;
Q4:SELECTS.zipcode,W.temp,W.month,W.yearFROM
stationDataSJOINweatherReportWONS.stationid=
W.stationidORDERBYW.tempDESCLIMIT10;
C5:importorg.apache.spark.sql._
C6:valbase_data=sc.textFile("hdfs://sandbox.hortonworks.
com:8020/data/table.csv")
C7:valattributes=base_data.first
C8:valdata=apple_stocks.filter(_(0)!=attributes(0))
C9:caseclassAppleStockRecord(date:String,open:Float,
high:Float,low:Float,close:Float,volume:Integer,
adjClose:Float)
C10:valapplestock=data.map(_.split(",")).map(row=>
AppleStockRecord(row(0),row(1).trim.toFloat,
row(2).trim.toFloat,row(3).trim.toFloat,row(4).trim.toFloat,
row(5).trim.toInt,row(6).trim.toFloat,
row(0).trim.substring(0,4).toInt)).toDF()
C11:applestock.registerTempTable("applestock")
C12:applestock.show
C13:applestock.count
C14:output.map(t=>"Record:"+t.toString).collect().
foreach(println)
Q1:valoutput=sql("SELECT*FROMapplestockWHEREclose>=
open")
Q2:valoutput=sql("SELECTMAX(closeopen)FROMapplestock")
Q3:valoutput=sql("SELECTdate,highFROMapplestockORDER
BYhighDESCLIMIT10")
Q4:valoutput=sql("SELECTyear,AVG(Volume)FROMapplestock
WHEREyear>1999GROUPBYyear")
Q1: Report the records where the close price is greater or equal
than the open price.
Q2: Identify the record with the largest difference between the
close and the open prices.
Q3: Report the ten records with the highest high prices.
Q4: Report the average stock volume per year, considering only
years greater than 1999.
A key feature of Spark is that the output of SQL queries can be used
as the input of MapReduce-like computations. This is, in fact, done
in C14 to print the results of the SQL queries.
C1:hbaseshell
C2:create'OccupationsData','code','description',
'total_emp','salary'
C3:exit
C4:hive
C5:CREATEEXTERNALTABLEOccupations(keySTRING,description
STRING,total_empint,salaryint)ROWFORMATDELIMITEDFIELDS
TERMINATEDBY','STOREDBY'org.apache.hadoop.hive.hbase.HBa
seStorageHandler'WITHSERDEPROPERTIES("hbase.columns.mappin
g"=":key,description:description,total_emp:total_emp,salary
:salary")TBLPROPERTIES("hbase.table.name"="OccupationsData");
C6:hbase.org.apache.hadoop.hbase.mapreduce.ImportTsv'
Dimporttsv.separator=,'Dimporttsv.columns=HBASE_ROW_KEY,
details:code,details:description,details:total_emp,details:sal
aryOccupationsData/home/cloudera/occupations.csv;
C7:impalashell
C8:invalidatemetadata[[default.]Occupations]
Q1:SELECTdescription,salaryFROMOccupationsORDERBY
salaryDESCLIMIT10;
Q2:SELECTdescription,salary/total_empASPerPersonSalary
FROMOccupationsORDERBYPerPersonSalaryDESCLIMIT10;
Q3:SELECTSUM(salary)/SUM(total_emp)ASAvgSalaryFROM
occupationsIncomeHBaseWHEREdescriptionLIKE'%computer%';
Q4:SELECTcode,description,salaryFROMOccupationsWHERE
SalaryIN(SELECTMAX(salary)FROMOccupations);
Q4: Show the occupation that has the highest combined income.
C1:md\data\db
C2:C:\mongodb\bin\mongod.exe
C3:C:\mongodb\bin\mongo.exe
C4:showdbs
C5:C:\mongodb\bin\mongoimport.exeh<host:port>d<your
databasename>c<collection>u<dbuser>p<dbpassword>
typecsvfile<pathto.csvfile>headerline
C6:mongodb://<dbuser>:<dbpassword>@<host:port>/<dbname>
Q1:SELECTN1FROM"/project/taxes"WHEREZIPCODE=85304
Q2:SELECTN2,ZIPCODEFROM"/project/taxes"WHERESTATE='NY'
Q3:SELECTZIPCODE,A00100FROM"/project/taxes"WHERE
STATE='NY'ANDZIPCODE!=0ORDERBY(A00100)DESCLIMIT10
Q4:SELECTDISTINCTSTATE,AVG(A00200)FROM"/project/taxes"
WHEREZIPCODE!=0ANDZIPCODE!=99999GROUPBYSTATE
C1: CREATETABLEevent_data(utc_timeTIMESTAMPNOTNULL,
creative_idINTEGERNOTNULL,costDECIMAL,campaign_id
INTEGERNOTNULL,advertiser_idINTEGERNOTNULL,site_id
INTEGERNOTNULL,is_impressionINTEGERNOTNULL,
is_clickthroughINTEGERNOTNULL,is_conversionINTEGER
NOTNULL);
Q1: SELECTsite_id,COUNT(*)ASrecords,SUM(is_impression)
ASimpressions,SUM(is_clickthrough)ASclicks,
SUM(is_conversion)ASconversions,SUM(cost)AScostFROM
event_dataGROUPBYsite_id;
Q2: SELECTadvertiser_id,COUNT(*)ASrecords,
SUM(is_impression)ASimpressions,SUM(is_clickthrough)AS
clicks,SUM(is_conversion)ASconversions,SUM(cost)AScost
FROMevent_dataGROUPBYadvertiser_id;
Q3: SELECTadvertiser_id,TRUNCATE(DAY,utc_time)ASutc_day,
COUNT(*)ASrecords,SUM(is_impression)ASimpressions,
SUM(is_clickthrough)ASclicks,SUM(is_conversion)AS
conversions,SUM(cost)ASspentFROMevent_dataGROUPBY
advertiser_id,TRUNCATE(DAY,utc_time);
Q4: SELECTcreative_idASad,SUM(cost)AScostFROM
event_dataGROUPBYcreative_idORDERBYcostDESCLIMIT10;
8. REFERENCES
[1] ASU. SQL: From Traditional Databases to Big Data - course
resources. http://www.public.asu.edu/~ynsilva/iBigData.
[2] Barron's. Michael Stonebraker Explains. http://blogs.barrons.
com/techtraderdaily/2015/03/30/michael-stonebraker-describ
es-oracles-obsolescence-facebooks-enormous-challenge/.
[3] D. Kumar. Data science overtakes computer science? ACM
Inroads, 3(3):1819, Sept. 2012.
[4] A. Cron, H. L. Nguyen, and A. Parameswaran. Big data.
XRDS, 19(1):78, Sept. 2012.
[5] A. Sattar, T. Lorenzen, and K. Nallamaddi. Incorporating
nosql into a database course. ACM Inroads, 4(2):5053, June
2013.
[6] Y. N. Silva, S. W. Dietrich, J. Reed, L. Tsosie. Integrating
Big Data into the Computing Curricula. In ICDE 2015.
[7] Apache. Hadoop. http://hadoop.apache.org/.
[8] Apache. Hive. https://hive.apache.org/.
[9] Apache. Spark SQL. https://spark.apache.org/sql/.
[10] 10gen. Mongodb. http://www.mongodb.org/.
[11] Apache. Hbase. http://hbase.apache.org/.
6. DISCUSSION
The proposed modules can be integrated into several introductory
and advanced database-related courses. One approach is to adopt
Big Data management systems as a replacement of the traditional
relational databases currently used in most introductory database
courses. We recommend using a NewSQL system, e.g., VoltDB,
for this approach since it supports both the relational model and
SQL. An alternative approach for introductory database courses is
to complement the use of a traditional database with a Big Data
system. In this case, instructors can teach database fundamentals
using traditional databases and then show how the same or similar
concepts can be applied to modern Big Data systems. Another
approach it to integrate the study of SQL beyond traditional
databases as part of a course in Big Data. This course can study the
use of SQL in MapReduce, NoSQL and NewSQL systems, and
compare it with other native query languages. The authors have
previously integrated the study of NewSQL systems in a Big Data
course and are currently integrating the study of SQL in a Big Data
system as part of an undergraduate database course. We plan to
report the results of these experiences in a future work.
7. CONCLUSIONS
Many application scenarios require processing very large datasets
in a highly scalable and distributed fashion and different types of
Big Data systems have been designed to address this challenge.