Professional Documents
Culture Documents
Finally, the time spent next to all these persons has been
carved in our memories, they remain the example to
follow, and were hoping that someday, we will be able to
convey to our turn, as much as we were able to receive.
TABLE OF CONTENTS
Acknowledgments ......................................................... 2
Table of contents ............................................................ 4
Life before Business Intelligence ................................... 6
BI at a glance ................................................................ 8
INSEE Presentation .................................................... 11
Phase 1: Identifying the project perimeter
Project goals........................................................ 13
Project progress .................................................. 14
Phase 2: Data Sourcing and ETL
Data Understanding ........................................... 18
Data cleaning ..................................................... 19
Data acquisition & ETL Process ......................... 21
Pentahos ETL tool .............................................. 23
Phase 3: Conceiving the Data Warehouse
Data warehousing ............................................... 24
Conceptual Data Model ...................................... 26
Physical Data Model ............................................. 4
The Data marts ..................................................... 4
LIFE BEFORE BI
In the beginning was the data, and the data was hidden
away somewhere deep in the bowels of the corporate
databases, where only an elite of highly trained users were
able to reach it.
When access to this data was needed, the only way to get
at it, was to ask or even beg one of those highly trained
elite users for help (mainly if the person whos asking for
these information isnt a computer scientist). But when
the query finally made its way to the top of Mr. Elite
Users in-tray, often several months later, the information
that trickled down, in the form of a spreadsheet or even a
printed report would be horrendously out-of-date.
BI AT A GLANCE
A lot of vague terms were being tossed around to define
Business Intelligence: to one Business person, it means
market research, something we would call competitive
intelligence. To another person, reporting may be a
better term, even though business intelligence goes well
beyond accessing a static report. Reporting and
analysis are terms frequently used to describe business
intelligence too. Others will use terms such as business
analytics or decision support, both with varying
degrees of appropriateness.
to
compare
market
shares
for
different
Performance
management:
Performance
performance
indicators
and
goals
using
INSEE PRESENTATION
France's National Institute of Statistics and Economic
Studies (Institut National de la Statistique et des tudes
conomiques: INSEE) is a Directorate General of the
Ministry of the Economy, Finance, and Employment. It is
therefore a government agency whose personnel are
government employees, although not all belong to the
civil service. INSEE operates under government accounting
rules: it receives its funding from the State's general
budget.
Getting to know INSEE
Main goal and missions, legislative framework, INSEE in
the European statistical system, brief history, INSEE
resources, working at INSEE.
Official Statistics
The official statistical system collects the data needed to
compile quantitative results. In this capacity, it undertakes
PHASE 1:
ACHIEVEMENT CONTEXT
PROJECT PROGRESS
As a first and foremost important step in our BI project,
we strategically started with identifying the project
perimeter, we mainly analyzed and tried to understand all
the data in the given spreadsheets, then, we did set the
goals and the ultimate objectives relaying on every
specified note.
After identifying the project context and boundary, we
cleaned, swabbed and filtrated the data using the
appropriate ETL tools. ETL, which stands for EXTRACT
TRANSFORM AND LOAD, is the set of functions combined
into one solution that enables to extract data from
numerous databases, sources, applications and systems,
transform it as appropriate, and load it into another
database, a data mart or a data warehouse for analysis, or
send it along to another operational system to support a
business process. Creating a Data Warehouse was the
next phase: we tried to keep in mind that a DW is most
likely to success, if its highly organized and flexible.
PHASE 2:
DATA SOURCING & ETL
DATA UNDERSTANDING
DATA CLEANING
Data understanding is not an obligatory one, but useful
from many aspects. Main role of data surveying in this
stage is finding out from the general structure of the data,
whether or not there is useful amount of information
enfolded in extracted or given data, which lead us to the
data cleaning phase. Basic as it is, its purpose is to get
healthy Data that can improve final modeling results. This
included checking the consistency of individual attribute
values and types, quantity, removing redundancy and
finding of outliers: we did detect a few anomalies
regarding the slight difference in the the number of
recruits compared to those accepted in internal contests,
especially when there is not free intake test (concours HF
file).
Checking in this phase deals with completeness and
correctness of data. Completness defines the proportion
and regularity of missing values in data. Correctness is
related to discovery of erroneous values present in data,
their extent and possible remedies.
PHASE 3:
DESIGNING THE DATA
WAREHOUSE
DATA WAREHOUSING
In
change
cases
where
discontinuously
external
or
business
organizations
Entities List:
Marriage
canton
Arrondissement
Departement
Territoire
Activite
Rgion
Etat
Bibliothque
Population
Commune
Categorie_concours
Naissances Et Dcs
Type_concours
Sous _Activit
Concours
Type_recrutement
Categorie_SP
Equipement
Sexe
Bac
Classe_sup
Branche
Categorie_classe_sup
Sample of Associations:
concern
pourcentage
Belong to
Has
a pr etablissement
Categorie_SP
s'applique
s'adresse
A pour pib_region
intresse
appartient
DATA MARTS
In order to conceive our data marts, we had to form at first
our dimensions and our fact tables.
We started by
Classes
shows
the
percentage
of
the
used
PHASE 4:
OPERATING & DISSECTING
THE DM
PHASE 5:
DATA MINING PHASE
ANALYSING DATA
To analyze data, weve chosen to work with the tool
"WEKA. The advantage being that this tool is programmed
in JAVA and therefore relatively fast. Moreover it is
extremely reliable. It has all the algorithms, classification
and searching functions. Besides it contains and offers a
large range of performance when it comes to graph
conceiving.
TEST METHODS
We were interested to decision trees and methods of KMeans. We started by the decision trees, we applied the j48 algorithm ( an improved version of the algorithm C4.5
Quinlain).
Decision trees:
The figure below is a decision tree listing similar
departments in terms of births and deaths
B- K- Means
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 94.93524745601803
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
Full Data
0
1
(96)
(78)
(18)
=========================================================
====================
pib
17670552083.3333
10561371794.8718
48477000000
nom_departement
Ain
Aisne
Ain
Clustered Instances
0
1
78 ( 81%)
18 ( 19%)
K=3
kMeans
======
Number of iterations: 14
Within cluster sum of squared errors: 93.52052880969502
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
Full Data
2
(96)
(72)
(21)
(3)
===========================================================
=======================================
pib
17670552083.3333
9343222222.2222 34178333333.3333
101972000000
nom_departement
Ain
Aisne
Ain
Alpes-Maritimes
1
Clustered Instances
0
1
2
72 ( 75%)
21 ( 22%)
3 ( 3%)
K=4
kMeans
======
Number of iterations: 14
Within cluster sum of squared errors: 93.52052880969502
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
Full Data
2
(96)
(72)
(21)
(3)
===========================================================
=======================================
pib
17670552083.3333
9343222222.2222 34178333333.3333
101972000000
nom_departement
Ain
Aisne
Ain
Alpes-Maritimes
1
K-4
Clustered Instances
0
1
2
72 ( 75%)
21 ( 22%)
3 ( 3%)
K=6
kMeans
======
Number of iterations: 11
Within cluster sum of squared errors: 90.27658195695966
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
3
Full Data
2
5
(96)
(25)
(27)
(7)
(20)
(3)
(14)
===========================================================
===========================================================
===========================================================
=================
pib
17670552083.3333
7723840000
14707333333.3333
43717571428.5714
3708000000
110004333333.3333
28284500000
nom_departement
Ain
Aisne
Ain
Alpes-Maritimes
Alpes-de-Haute-Provence
Bouches-du-Rh?ne
Finist?re
1
4
Clustered Instances
0
1
2
3
4
5
25
27
7
20
3
14
(
(
(
(
(
(
26%)
28%)
7%)
21%)
3%)
15%)
K=7
kMeans
======
Number of iterations: 12
Within cluster sum of squared errors: 89.27102393425096
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
3
6
(23)
(20)
(14)
1
4
(18)
(3)
Full Data
2
5
(96)
(7)
(11)
pib
17670552083.3333
7465304347.8261
12792333333.3333
43717571428.5714
3708000000
110004333333.3333
29749727272.7273
18354714285.7143
nom_departement
Ain
Allier
Ain
Alpes-Maritimes
Alpes-de-Haute-Provence
Bouches-du-Rh?ne
Finist?re
Calvados
Clustered Instances
0
1
2
3
4
5
6
23
18
7
20
3
11
14
(
(
(
(
(
(
(
24%)
19%)
7%)
21%)
3%)
11%)
15%)
And K=10
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 86.26315726326193
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
3
6
9
1
4
7
Full Data
2
5
8
(96)
(16)
(8)
(7)
(19)
(3)
(9)
(12)
(9)
(5)
(8)
===========================================================
===========================================================
pib
17670552083.3333
6707000000
13605250000
43717571428.5714
3614684210.5263
110004333333.3333
31099777777.7778
16968916666.6667
11780333333.3333
23217000000
8733875000
nom_departement
Ain
Allier
Ain
Alpes-Maritimes
Alpes-de-Haute-Provence
Bouches-du-Rh?ne
Haute-Garonne
Aisne
Calvados
Finist?re
Charente
Clustered Instances
0
1
2
3
4
5
6
7
8
9
16
8
7
19
3
9
12
9
5
8
( 17%)
( 8%)
( 7%)
( 20%)
( 3%)
( 9%)
( 13%)
( 9%)
( 5%)
( 8%)
Marriages
K=2
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 204.48640349553273
Missing values globally replaced with mean/mode
Cluster centroids:
Attribute
1
Full Data
Cluster#
0
(200)
(150)
(50)
===========================================================
============
nbrdecesdomicillie
5273.02
3728.2133
9907.44
nbrnaissancesvivantedomicillie
8232.795
4988.06
17967
nbrmariages
2738.765
1862.4467
5367.72
nom_departement
Val-de-Marne
Val-d'Oise
Val-d'Oise
Clustered Instances
0
1
150 ( 75%)
50 ( 25%)
K=4
kMeans
======
Number of iterations: 12
Within cluster sum of squared errors: 196.85641391574097
Missing values globally replaced with mean/mode
Cluster centroids:
Attribute
1
Full Data
2
Cluster#
0
(200)
(38)
(16)
(71)
(75)
===========================================================
======================================
nbrdecesdomicillie
5273.02
8254.8421
12836.3125
4866.5493
2533.52
nbrnaissancesvivantedomicillie
8232.795
13968.2105
26404.75
6736.1127
2867.0267
nbrmariages
2738.765
5206.1579
5695.125
2441.3239
1139.5067
nom_departement
Val-d'Oise
Val-d'Oise
Val-de-Marne
Marne Haute-Marne
Clustered Instances
0
1
2
3
.
.
.
K=10
38
16
71
75
( 19%)
( 8%)
( 36%)
( 38%)
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 183.90874037493157
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
4
8
1
5
9
2
6
Full Data
3
7
(200)
(41)
(16)
(16)
(10)
(8)
(12)
(34)
(4)
(22)
(37)
===========================================================
===========================================================
===========================================================
===============================
nbrdecesdomicillie
5273.02
6911.3125
15436.7
4210.75
5234.1951
3065.8333
8589.7059
1470
2498.6875
1555.8636
3579.4595
nbrnaissancesvivantedomicillie
8232.795
10702.6875
27517.5
5244.625
6986.2195
2728.4167
16988.5294
4585.75
3052.625
1468
4376.1351
nbrmariages
2738.765
2008.9375
5640.6
5096.875
2694.6341
985.8333
5501.7353
813
1220.125
715.9091
1906.3784
nom_departement
Val-d'Oise
Meuse
Paris
Oise
Aisne
Indre
Val-d'Oise Tarn-et-Garonne
Aube
Haute-Marne
Marne
Clustered Instances
0
1
2
3
4
5
6
7
8
9
16
10
8
41
12
34
4
16
22
37
(
(
(
(
(
(
(
(
(
(
8%)
5%)
4%)
21%)
6%)
17%)
2%)
8%)
11%)
19%)
CONCLUSION