You are on page 1of 24

School of graduate studies,

St. Mary's University


July 2018
¡ known as data-driven science, is an interdisciplinary field
of scientific methods, processes, algorithms and systems
to extract knowledge or insights from data in various
forms, either structured or unstructured data.
§ Solve analytically complex problems using data.
¡ It employs different techniques and theories drawn from
§ Mathematics,
§ Statistics,
§ Information and Computer Science
¡ A sub-domains of machine learning, classification, cluster
analysis, uncertainty quantification, computational
science, data mining, databases, and visualization.
¡ Data are raw facts and figures that on their own have no
meaning. (e.g. readings from sensors, survey facts, etc)
¡ Data can be numbers, words, letters, images, sound etc.
Yes, Yes, No, Yes, No, Yes, No, Yes
42, 63, 96, 74, 56, 86
111192, 111234

¡ None of the above data have any meaning until they


are given a CONTEXT and PROCESSED into a
useable form
§ Thus we need to process data in to information to
make it meaning full and important.
¡ To achieve its aims the organisation will need to process
data into information.
¡ Data needs to be turned into meaningful information and
presented in its most useful format
¡ Data must be processed in a context in order to give it

meaning. 40

Te m p e ratu re
39

(C e lsiu s)
38
37
36

0
:0

:3

:0

:3

:0

:3
08

08

09

09

10

10
Tim e
¡ To turn data into information it needs to be processed.
Information
Processing

Data
¡ Information is data that has been processed by a
computer system to give it meaning.
¡ Processed can mean:
§ Having calculations performed on it
§ Converted to give it meaning
§ Organized in some way
Yes, Yes, No, Yes, No, Yes, No,
Raw Data Yes, No, Yes, Yes

Context ????
Processing

Information ????
35.8, 36.2, 37.0, 38.4, 37.1, 35.8,
Raw Data 36.2, 37.0, 38.4, 37.0, 38.4, 37.1

Context
??????????
Processing

Information ??????????
Raw Data 091017, 111618

Context
????
Processing

Information ????
Data Information
Data is raw, unorganized facts that
When data is processed, organized,
need to be processed. Data can be
structured or presented in a given
Meaning something simple and seemingly
context so as to make it useful, it is
random and useless until it is
called information.
organized.
The average score of a class or of
Each student's test score is one
Example the entire student that can be
piece of data.
derived from the given raw data
¡ The amount of data generated by the typical modern business increases, so does
the prominence of data scientists hired by organizations to help them turn raw
data into valuable business information.
¡ Data extraction is the act of retrieving specific data from unstructured or poorly
structured data sources for further processing and investigation.

¡ Data-driven decisions are more profitable. Every minute,


§ Americans use 2,657,700GB of data
§ Instagram users post 46,750 photos
§ 15,220,700 texts are sent in the form of Email/SMS and
§ Google conducts 3,607,080 searches.
¡ Machine learning is changing world business through a better forecasting.
¡ Data scientists must possess a combination of analytic, machine learning, data
mining and statistical skills, as well as experience with algorithms and coding.
¡ Mathematics and Statistical knowledge
enable to view the data through a quantitative
lens. There are textures, dimensions, and
correlations in data that can be expressed
mathematically.
¡ Technology and Hacking skill is required for
a data scientists utilize technology in order to
wrangle enormous data sets and work with
complex algorithms, and it requires tools far
more sophisticated than Excel.
§ Data scientists need to be able to code
prototype quick solutions, as well as integrate
with complex data systems through different
program.
¡ Domain expert is another important for a data
scientist to be a tactical business consultant to
work closely with data.
¡ Being the study of where information comes from,
what it represents and how it can be turned into a
valuable resource in the creation of business and IT
strategies.
§ Mining large amounts of structured and unstructured data
to identify patterns can help an organization rein in costs,
increase efficiencies, recognize new market opportunities
and increase the organization's competitive advantage.
¡ Along with managing and interpreting large amounts of
data, many data scientists are also tasked with creating
data visualization models that help illustrate the
business value of digital information.
¡ Data scientists/expert draw the digital information they
are studying from a growing list of channels and
sources, including
§ Smartphones,
§ Internet of things (IoT) devices,
§ Social media,
§ Surveys,
§ Purchases,
§ Internet searches and behavior
¡ By sorting through these large data sets, data scientists
can identify patterns to solve problems through the
analysis of bigdata.
¡ The base for big data and data science is data.
¡ Data is first created and saved from various sources (sensors
at machines, user behavior on websites, applications and
computers and many more), then archived and finally
analyzed to answer specific questions, to find patterns or to
show special constellations.
¡ The data is the golden asset for a company in the future and
it’s very important to save and archive the data now. It’s
absolutely worthless to tell everybody that we could have all
data
§ i.e. for transactions, customer behavior, machine processes and
application logs.
¡ Creation of almost all information in digital form
§ Datafication
¡ Dramatic cost reduction in storage
§ You can afford to keep all the data
¡ Dramatic increases in network bandwidth
§ You can move the data to where it is needed
¡ Dramatic cost reduction and scalability improvements in
computation
¡ Dramatic algorithmic breakthroughs
§ Machine Learning, Data Mining, Fundamental advances
in CS and Statistics
¡ Ever more powerful models producing ever increasing
volumes of data that must be analyzed
¡ The Exponential Growth in volume and speed of data
introduced several challenges:
§ System management and growing cluster complexity
§ Data center power, cooling, and floor space limitations
§ Storage, data movement, and management complexity
§ Lack of support for heterogeneous environment and
accelerators
§ Significant shortage of skills to integrate and manage the
big data ecosystem
¡ Is a technological trend turning many aspects of our lives
into computerized data using processes to transform into
new forms of value.
¡ Datafication are used in :
§ Insurance: Update risk profile development and business models.
§ Banking: Establish trustworthiness and likelihood of a person
paying back a loan.
§ Human resources: Employees risk-taking profiles.
§ Hiring and recruitment: Replace personality tests.

¡ By sorting through these large data sets, data scientists can


identify patterns to solve problems through the analysis of
big data.
¡ A data-driven problem solving skill (domain expert)
¡ Mathematical and Statistical knowledge
¡ Technical skills required to become a data scientist
include:
§ Programming: You need to have the knowledge of
programming languages skill like Python, Perl, C/C++,
SQL, R and Java
§ Python and R being the most common coding language
required in data science roles.
¡ In general the data scientist is required to have a skill
§ Computer science, Math, Statistics, Machine Learning,
Domain expertise, Communication and presentation skills and
Data visualization
¡ Predicting iceberg paths: this occasionally requires icebergs to be towed to
avoid collisions
¡ Oil wells drilling optimization: how to digg as few test wells as possible to
detect the entire area where oil can be found
¡ Predicting solar flares: timing, duration, intensity and localization
¡ Predicting Earthquakes
¡ Predicting very local weather (short-term) or global weather (long-term);
reconstructing past weather (like 200 million years old)
¡ Predicting weather on Mars to identify best time and spots for a landing
¡ Predict riots based on tweets
¡ Designing metrics to predict student success, or employee attrition
¡ Predicting book sales, determining correct price, price elasticity and
whether a specific book should be accepted or rejected by a publisher,
based on projected ROI
¡ Predicting volcano risk, to evacuate populations or cancel flights, while
minimizing expenses caused by these decisions
¡ Predicting 500-year floods, to build dams
¡ Predict death, and health expenditures, to compute your premiums
(based on which population segment you belong to)
¡ Predicting reproduction rate in animal populations
¡ Predicting food reserves each year (fish, meat, crops including crop
failures caused by diseases or other problems). Same with
electricity and water consumption, as well as rare metals or
elements that are critical to build computers and other modern
products.
¡ Predicting longevity of a product, or a customer
¡ Predicting duration, extent and severity of draught or fires
¡ Predicting racial and religious mix in a population, detecting change
point to adapt policies accordingly
¡ Predicting new flu viruses to design efficient vaccines each year
¡ Road constructions and traffic lights designed to optimize highway
traffic.
¡ Google algorithm to predict duration of a road trip, doing much
better than GPS systems not connected to the Internet.
¡ Spell checks, especially for people writing in multiple languages
¡ Distinguishing between noise and signal on millions of pictures or
videos, to identify patterns
¡ Automated piloting (drones, cars without pilots)
¡ Customized, patient-specific medications and diets
¡ Predicting and legally manipulating elections
¡ Sport bets
¡ Predicting oil demand, oil reserves, oil price, impact of coal usage
¡ Predicting chances that a container in a port contains a nuclear
bomb
¡ Assessing the probability that a convict is really the culprit,
especially when a chain of events resulted in a crime or accident
¡ Computing correct average time-to-crime statistics for an average
gun (using censored models to compensate for the bias caused by
new guns not having a criminal history attached to them)

You might also like