You are on page 1of 66

ACKNOWLEDGEMENTS

Silent gratitude isnt much use to anyone.


[Gladys Bronwyn Stern]

Our first words of recognition and gratitude are owed to


Mr. Kamel SMAILI,

for allowing us the honor to be our

beloved teacher and supervisor, he responded to our


numerous requests, and helped us by his professionalism
and experience to accomplish a better job. He managed to
make us appreciate the "beauty" of Business Intelligence,
and through his marvelous way of teaching, we could
approach the fascinating world of Data Warehouse and
Data Mining.

We would like as well to express our appreciation to all the


students of this master, weve been so fortunate to be in
such an impeccable atmosphere. Thank you for your

unending support, your devotion and for making this year


as rewarding and enjoyable to live.

Many thanks to Mr. Abdelatif Bouhlal, whos no longer


teaching us, but from whom weve learned so much
regarding the art of eloquence. Were deeply grateful to
you sir and well truly never forget you.

Finally, the time spent next to all these persons has been
carved in our memories, they remain the example to
follow, and were hoping that someday, we will be able to
convey to our turn, as much as we were able to receive.

TABLE OF CONTENTS
Acknowledgments ......................................................... 2
Table of contents ............................................................ 4
Life before Business Intelligence ................................... 6
BI at a glance ................................................................ 8
INSEE Presentation .................................................... 11
Phase 1: Identifying the project perimeter
Project goals........................................................ 13
Project progress .................................................. 14
Phase 2: Data Sourcing and ETL
Data Understanding ........................................... 18
Data cleaning ..................................................... 19
Data acquisition & ETL Process ......................... 21
Pentahos ETL tool .............................................. 23
Phase 3: Conceiving the Data Warehouse
Data warehousing ............................................... 24
Conceptual Data Model ...................................... 26
Physical Data Model ............................................. 4
The Data marts ..................................................... 4

Tools used ............................................................. 4


Phase 4 : Operating and dissecting the data Marts ..... 1
Data format & Calculated fields ........................... 4
Reports generation................................................ 4
Phase 5 : Data Mining Phase ........................................ 4
Data Mining Presentation ..................................... 4
Analysing Data ...................................................... 4

LIFE BEFORE BI

In the beginning was the data, and the data was hidden
away somewhere deep in the bowels of the corporate
databases, where only an elite of highly trained users were
able to reach it.

When access to this data was needed, the only way to get
at it, was to ask or even beg one of those highly trained
elite users for help (mainly if the person whos asking for
these information isnt a computer scientist). But when
the query finally made its way to the top of Mr. Elite
Users in-tray, often several months later, the information
that trickled down, in the form of a spreadsheet or even a
printed report would be horrendously out-of-date.

As for whether Mr. Elite User was likely to understand the


business requirements asked in the first place and so avoid
supplying with wrong (or at best irrelevant) information.

Business intelligence remains the solution to this hideous


problem; it sure not only, provides easy access to business
data with its architecture and its collection of integrated
operational as well as decision-support applications, but
improve the ability to study past behaviors and actions in
order to understand more where the organization or the
company stands.

Put simply, BI lets you make better business decisions


because it gives you access to the right information at the
right time.

BI AT A GLANCE
A lot of vague terms were being tossed around to define
Business Intelligence: to one Business person, it means
market research, something we would call competitive
intelligence. To another person, reporting may be a
better term, even though business intelligence goes well
beyond accessing a static report. Reporting and
analysis are terms frequently used to describe business
intelligence too. Others will use terms such as business
analytics or decision support, both with varying
degrees of appropriateness.

How these terms differ matters poorly, unless you are


trying

to

compare

market

shares

for

different

technologies. What matters most is to use the terminology


that is most familiar to intended users and that has a
positive connotation. No matter which terminology you
use, keep the ultimate value of business intelligence in
mind which is providing a pertinent insight, so you can

measure performance in order to take action at a time


when it is still possible, to eventually reach your goals.
Best of all, it lets you do it all yourself, rather than having
to depend on IT professionals to provide you with the data
you need at a time that suits their schedule; it allow you
also to track, understand, manage your business and
several others options such as :
Reporting: Reporting, as its name suggests, enables
you to format and deliver information to large
audiences both inside and outside your organization
in the form of reports.

Query and analysis: Query and analysis tools provide


you with a means of interacting with business
information (by performing your own adhoc queries)
without having to understand the often complex
data that lies underneath this information.

Performance

management:

Performance

management tools let you keep track of and analyze


key

performance

indicators

and

goals

using

Dashboards, Scorecards, and Analytics.


What Business Intelligence Is not:

BI is neither a product nor a system.


A data warehouse may or may not be a component
of your business intelligence architecture, but a data
warehouse is not synonymous with business
intelligence.

INSEE PRESENTATION
France's National Institute of Statistics and Economic
Studies (Institut National de la Statistique et des tudes
conomiques: INSEE) is a Directorate General of the
Ministry of the Economy, Finance, and Employment. It is
therefore a government agency whose personnel are
government employees, although not all belong to the
civil service. INSEE operates under government accounting
rules: it receives its funding from the State's general
budget.
Getting to know INSEE
Main goal and missions, legislative framework, INSEE in
the European statistical system, brief history, INSEE
resources, working at INSEE.
Official Statistics
The official statistical system collects the data needed to
compile quantitative results. In this capacity, it undertakes

censuses and surveys, manages databases, and also draws


on administrative sources.
Quality at the INSEE
This quality rubric describes the rules, methods and
resources that enable official statistics to meet quality
requirements as well as possible. Such a description draws
direct inspiration from the fifteen principles and related
indicators from the European Statistics Code of Practice.
French, European and International statistical sites
Statistics production is conducted under a program, which
is a "decision" applicable to the Member States. INSEE
helps to design and implement multilateral cooperation
programs under the aegis of international organizations
such as Eurostat, U.N. institutions, the World Bank, and
the International Monetary Fund (IMF).
Seminars, conferences and fairs
Conferences and seminars organized by Insee or in which
Insee has participated.

PHASE 1:
ACHIEVEMENT CONTEXT

PROJECT GOALS & OBJECTIVES


The first thing that we had to do primarily is to define the
objectives and the goals for this BI project because it is
practically impossible to create or accomplish a valid
project without a solid understanding of the scope.
Mainly, the objectives of this BI project are:

To transform data into meaningful information


to support effective decisions by improving its
quality, consistency and completeness.

Build a data warehouse based on INSEE results


(the employees distribution, the repartition of
student loans, the population statistics, Rates
of death, birth, weeding ) to set the stage for
successful and effective data mining.

Deploy and exploit brightly the data warehouse


with the appropriate tools.

Generate specific and flexible reports.

PROJECT PROGRESS
As a first and foremost important step in our BI project,
we strategically started with identifying the project
perimeter, we mainly analyzed and tried to understand all
the data in the given spreadsheets, then, we did set the
goals and the ultimate objectives relaying on every
specified note.
After identifying the project context and boundary, we
cleaned, swabbed and filtrated the data using the
appropriate ETL tools. ETL, which stands for EXTRACT
TRANSFORM AND LOAD, is the set of functions combined
into one solution that enables to extract data from
numerous databases, sources, applications and systems,
transform it as appropriate, and load it into another
database, a data mart or a data warehouse for analysis, or
send it along to another operational system to support a
business process. Creating a Data Warehouse was the
next phase: we tried to keep in mind that a DW is most
likely to success, if its highly organized and flexible.

Then, we exploited and analyze all the Data Marts, using


the options offered by Cognos, we generate also several
reports, and adjust the values format too.
Subsequently to this stage, was the Data Mining Phase,
devoted to apply benefits from collections of data, to
improve business by predicting and understanding
behaviors. Finally, as BI is aimed to response to all types

of issues, we inferred in this last phase descriptive or


explanatory models and we construed and interpret all
the results.

PHASE 2:
DATA SOURCING & ETL

DATA UNDERSTANDING

After setting up our Bi projects perimeter and goals, we


proceeded with a very central step which is the data
understanding. There are several things to be learned
about the data even after creating the Data Warehouse or
mining it, such as identifying entities and meanings of
individual attributes.

Fortunately, we didnt have to collect data - a really crucial


phase, chiefly when it comes to several sources- the Excel
spreadsheets given were widely enough, but we did
however have our share of problems, problems relative
mainly to data comprehension, since some informations
were missed (mostly Doms data), other were misplaced,
per example: townships and townships fractions we had
to grasp the confusing or ambiguous combinations, and it
took us a long time to seize it.

DATA CLEANING
Data understanding is not an obligatory one, but useful
from many aspects. Main role of data surveying in this
stage is finding out from the general structure of the data,
whether or not there is useful amount of information
enfolded in extracted or given data, which lead us to the
data cleaning phase. Basic as it is, its purpose is to get
healthy Data that can improve final modeling results. This
included checking the consistency of individual attribute
values and types, quantity, removing redundancy and
finding of outliers: we did detect a few anomalies
regarding the slight difference in the the number of
recruits compared to those accepted in internal contests,
especially when there is not free intake test (concours HF
file).
Checking in this phase deals with completeness and
correctness of data. Completness defines the proportion
and regularity of missing values in data. Correctness is
related to discovery of erroneous values present in data,
their extent and possible remedies.

DATA ACQUISITION & ETL


PROCESS
It becomes very difficult to extract desired data. It is easy to
implement something that either misses the users
expectations or only partially satisfies them; Data
acquisition or the extract, transform, and load (ETL) process
is a complex set of activities whose sole principle is to
attain the most accurate and integrated data possible and
make it accessible to the enterprise through the data
warehouse.

It includes the following subprocesses :

Extracting which stands for copying the parts that we


needed to the data staging area for further work from the
INSEEs excel spreadsheets, and purging the data that will
not be used.

Transforming: Once the data was extracted into the data


staging area, we used as many possible transformations as
we could, including correcting misspellings, parsing the
data into standard formats (Like the PIB, we had to convert
it from the ME to the Euro), changing data into the
appropriate Type: the major problem with the data given is
that all the attributes and the values were Text, which is
really senseless, since there is dates involved, numeric
values . We had also to combine the sources, by matching
and aggregating the information that has the same context,
or even the same structure.

Loading: At the end of the transformation process, we were


able to load the data into CSV files, so that it can be easy to
import into the data base that will be created.

Nevertheless, we apply 80% of the ETL process manually,


the lens being that we had to have the cleanest DW
possible; 20% remaining was handled by an ETL tool called
Pentaho that will be explained in the next chapter.

PENTAHOS ETL TOOL


We have come a long way from the days where all set
activities had to be done manually: the BI industry has
developed a plethora of tools and technologies to support
the data acquisition process, weve chosen for our BI
project the Pentaho data integration, that offers first fullyunified ETL, modeling and data visualization development
environment for Agile BI.
Here's a preview of Penaho interface while using it for
data transforming:

PHASE 3:
DESIGNING THE DATA
WAREHOUSE

DATA WAREHOUSING

Data warehouses collect relevant data from multiple


different data sources, rationalize, summarize it and
catalog it in large consistent, stable, accurate, long term
data stores which allows not only, for all types of
questions to be answered but provides insights into data
to answer the same question asked multiple different
ways to support the decision making process.
Although specific vocabularies vary from organization to
organization, the data warehousing industry is in
agreement that the data warehouse lifecycle model is
fundamentally as described in the diagram of the next
page.

The model, which is a cycle rather than a serialized


timeline, consists of five major phases:

Design: Practically speaking, the best data warehousing


practitioners are those who combine data with indicators
and other critical business metrics.

Prototype: Developing a unanimous working model of a


data warehouse or data mart design, suitable for actual
use. The purpose is to allow a back and forth between
design and prototype.

Deploy: It is at this phase that the single most often


neglected component can undermine the whole process.
Operation: the day-to-day maintenance of the data
warehouse or mart, the data delivery services that
provides to analysts to keep the warehouse or mart
current.
Enhancement:
conditions

In

change

cases

where

discontinuously

external
or

themselves undergo discontinuous changes.

business

organizations

CONCEPTUAL DATA MODEL

The following diagram illustrates and defines the portions


that our Data Warehouse will contain.
Part 1:

We made this portion of the conceptual model, according


to the following management rules:

A contest refers to one and only category, type and intake


type, but categories may have assortment of contests, the
same thing can be applied to intake types and contest
types.
A socio-professional category has at least, one specific
equipment, and inversely equipment may have several
socio-professional categories.
Part 2:

To set up this part of our CDM, we were based on these


regulations:

A Superior Class belongs to one precise category, but a


category encloses many superior classes. There is an
association between class categories, gender and the BAC
option that includes an effective and a percentage of a
precise date.
Part 3:
This is nearly the major bit of our CDM, which embraces
most of the entities that we have. We should mention
however, that we did merge some of the data, because
they share the same structure, such as domiciled births
and deaths.

Entities List:
Marriage

canton

Arrondissement

Departement

Territoire

Activite

Rgion

Etat

Bibliothque

Population

Commune

Categorie_concours

Naissances Et Dcs

Type_concours

Sous _Activit

Concours

Type_recrutement

Categorie_SP

Equipement

Sexe

Bac

Classe_sup

Branche

Categorie_classe_sup

Sample of Associations:
concern

pourcentage

Belong to

Has

a pr etablissement

Categorie_SP

s'applique

s'adresse

A pour pib_region

intresse

A pour sous Activits

appartient

Sample of data Items

PHYSICAL DATA MODEL

A Physical Data Model includes all the database


entities/tables/views, attributes/columns/fields and the
relationship between the entities that we have defined.
Database performance, indexing strategy, physical storage
and denormalization are important considerations of
creating the physical data model. How the database is
created is dependent to all the constraints implemented in
the PDM.

DATA MARTS
In order to conceive our data marts, we had to form at first
our dimensions and our fact tables.

We started by

denormalizing the physical implementation, so that we can


put one fact in numerous places.
Foremost, it improves usability by grouping all the
associated attributes in a table, thus reducing significantly
the total number of tables which a user will face.

Our dimensions are as follows:

Activite Dimension: Merge of two tables ( Activit and sous


activt) this dimension presents the activities and
subactivities related to each turf

BAC Dimension: Merge of two tables (bac and bracnhe)


that nominates all the bac options.

Classes

Dimension: Merge of two tables (classes

suprieures and categorie_classe) which proffer all the


superior classes and its categories.

Commune Dimension: combination of quite a lot of tables


(communes associes, cantons, fractions cantonales,
arrondissement, commune), this dimension remains the
geographic dimension that specifies the territory and the
ground.

Region and Department Dimension: the two dimensions


refer to territory too, just like the commune dimension.

Concours Dimension: combination of two tables (type


concours et catgorie_concours, it presents all the
contests and the categories related to it.

Equipement, sexe, categoeir_sp, type-recrutement, etat


dimensions: they refer respectively to the equipments,
gender, socio-professional category, recruitment type, and
the state of data (If the GDP is final, semi-final or
provisory).
Though, these dimensions wont be handy, if its not for a
specific kind of tables, primary in each dimensial model
and containing the most useful facts, these tables are
called: the fact table.
Every fact table represents a many-to-many relationship
and every fact table encloses a set of two or more foreign
keys that join to their respective dimension tables.
This is a list of all the fact tables that weve gathered and
designed:
Etablissement: This fact table represents the number of
companies in each activity by year and district.
Etablissement details: this one represents the number of
companies in each sub activity.

Serie_Bac: introduces the number of the students and the


percentage of girls per district.
Bibliotheque: clarifies the loans and the rate of registered
made by region and year.
Effectif class_sup : specifies the students number of a
specific category by gender, according of course to a
school year
Concours : presents the number of admitted present or
recruited persons that applied to a contest, and presents
the percentage of women too.
Poucentage:

shows

the

percentage

of

the

used

equipments in every socio-professional category.


Mortalities: represents mortality rates per region per
year.
Marriage: introduces the number of weddings by
department and year.

Nb_naissances_deces: this fact table stipulates the


number of domiciled births and deaths.
PIB_Region: defines the GDP, the GDP per person, the
GDP per job to all the districts.
PIB_Departement: sets apart the GDP, the GDP per
person, and the GDP per job to all the departments.
Population: This final fact table presents municipal
population, and the one who is counted separately of all
the municipalities per year.
Withal, conceiving the data warehouse environment
usually takes the form of replicating the dimension tables
and fact tables, and presenting sometimes these tables as
logical subsets or complete pie-wedge of the overall
model known as data marts.

However, our data warehouse includes three Data marts


sorted by realm or context, we distinguished three ones :

The demographic Data Mart


This data mart treats everything that is related to
demography, like weddings, mortality, domiciled deaths
and births.

The Economic data Mart

This one refers to all the economic values and


measures, such as the GDP, the percentage of equipments
used by activities, the number of companies

The Education Data Mart

The last data mart shows how the dimensions related


to education are managed, as contests, type of contests,
rate of loans.

THE TOOLS USED


In terms of tools used, the choice was difficult, given
the progression of advanced information technologies.
The choice was made carefully and was as follows:

At first, we used xamp to create our Mysql data base


because of all the advantages that a Mysql db may offer,
such as:

The consolidated view of the base


Quickly testing of the reliability, security and
performance of the tables and the queries.
The robustness and ease of use of such an
Management System database.

But since we used Cognos 7, we had to convert our data


base to an access one, because unfortunately Cognos does
not support a Mysql Base so eventually, we had to
export it to an XML file, which gave rise to a format
problem : all the different types of data was converted o a
text type, so basically, we had to repair each field.

Even though, we refurbished the data base, we faced


several issues, especially when it comes to robustness of
Access and Cognos: concerning Acces, every time, we had

to execute a query that uses numerous tables, and


contains a lot of data, it took us a very long time (around
25 minutes to execute just one query). Cognos on the
other hand wasnt as sturdy as we hoped: the generation
of all the categories and the hypercube took us almost 90
minutes, so for every simple modification, we had to wait
90 minutes to regenerate the cube, updating or refreshing
it didnt cut it.
To avail the cube, we used the reporter mode, it is handier
and easy to handle.

PHASE 4:
OPERATING & DISSECTING
THE DM

DATA FORMAT & CALCULATED


FIELDS

Once we have completed all the steps of conceiving


the data warehouse, we finally got some data loaded, and
had to quarry it, but first we started by converting data
into its appropriate format: GDP to monetary type,
assigning the percentage sign, the Euro sign.

We also customized more than a few fields to make it easy


to understand or interpret, above all when it comes to
reporting, which will be stagger in the next chapter.

Calculated fields was a real help and release, we didnt


have to change our queries or create new ones, to obtain
Data we used it most when in the data Mart related to
studies. We could view the result of a formula that uses
information from other fields in the cube.

DATA FORMAT & CALCULATED


FIELDS
Cognos provides among other options the ability to create,
deploy and manage interactive, tabular or even graphical
reports, from multiple data sources. We tried to generate
the essential ones.

The two reports present the births and deaths domiciled in


France, the first one is general, but the second concerns
only the departments of Dom for 2006 and 2007.

In these graphs, we applied the personalized values, red


refers to values under 30%, green between 40% and 50%
and black above 50%.

The first two graphs of the previous page refers to the


number of weddings in 200- and 2007, and as we see ,
there is a slight difference between the two graphs, with Ile
de France remaining as the municipality with the highest
number. The other report presents the GDP of all the 26
regions, with Rhone Alpes as the fist region in term of GDP.
And the last report clarifies its evolution for Metropolitan
France (2000 to 2007), and shows that the GDP didnt
retreat at all.

PHASE 5:
DATA MINING PHASE

DATA MINING PRESENTATION


According to the Gartner Group, Data mining is the
process of discovering meaningful new correlations,
patterns and trends by sifting through large amounts of
data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical
techniques. There are other definitions:
Data mining is the analysis of observational data
sets to find unsuspected relationships and to
summarize the data in new ways.
Data mining is an interdisciplinary field bringing
togther techniques from machine learning, pattern
recognition, statistics, databases, and visualization
to address the issue of information extraction from
large data bases.

However, we tried as hard as we could to describe,


estimate, predict, classify, cluster and associate the data
that we had.

ANALYSING DATA
To analyze data, weve chosen to work with the tool
"WEKA. The advantage being that this tool is programmed
in JAVA and therefore relatively fast. Moreover it is
extremely reliable. It has all the algorithms, classification
and searching functions. Besides it contains and offers a
large range of performance when it comes to graph
conceiving.

After configuring weka correctly and establishing the


connection, we retried the data that we wanted to
analyze, using the explorer interface of the tool.

TEST METHODS
We were interested to decision trees and methods of KMeans. We started by the decision trees, we applied the j48 algorithm ( an improved version of the algorithm C4.5
Quinlain).

Decision trees:
The figure below is a decision tree listing similar
departments in terms of births and deaths

The second example concerns the decision tree


classification of GDP by department according to their
values

B- K- Means

In statistics and machine learning, k-means clustering is a


method of cluster analysis which aims to partition nobservations into k clusters in which each observation
belongs to the cluster with the nearest mean. It is similar
to the expectation-maximization algorithm for mixtures of

Gaussians in that they both attempt to find the centers of


natural clusters in the data as well as in the iterative
refinement approach employed by both algorithms.

We take advantage of this algorithm to test our data for


GDP and number of marriages in the departments
The results are:
GDP (Gross domestic product)
K=2

=== Run information ===


Scheme:
weka.clusterers.SimpleKMeans -N 2 -A
"weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation:
QueryResult
Instances:
96
Attributes:
2
pib
nom_departement
Test mode:
evaluate on training data
=== Model and evaluation on training set ===

kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 94.93524745601803
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#

Attribute
Full Data
0
1
(96)
(78)
(18)
=========================================================
====================
pib
17670552083.3333
10561371794.8718
48477000000
nom_departement
Ain
Aisne
Ain

Clustered Instances
0
1

78 ( 81%)
18 ( 19%)

K=3
kMeans
======
Number of iterations: 14
Within cluster sum of squared errors: 93.52052880969502
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0

Full Data
2
(96)
(72)
(21)
(3)
===========================================================
=======================================
pib
17670552083.3333
9343222222.2222 34178333333.3333
101972000000
nom_departement
Ain
Aisne
Ain
Alpes-Maritimes
1

Clustered Instances
0
1
2

72 ( 75%)
21 ( 22%)
3 ( 3%)

K=4
kMeans
======
Number of iterations: 14
Within cluster sum of squared errors: 93.52052880969502
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0

Full Data
2
(96)
(72)
(21)
(3)
===========================================================
=======================================
pib
17670552083.3333
9343222222.2222 34178333333.3333
101972000000
nom_departement
Ain
Aisne
Ain
Alpes-Maritimes
1

K-4

Clustered Instances
0
1
2

72 ( 75%)
21 ( 22%)
3 ( 3%)

K=6
kMeans
======
Number of iterations: 11
Within cluster sum of squared errors: 90.27658195695966
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#

Attribute
0
3

Full Data
2
5
(96)
(25)
(27)
(7)
(20)
(3)
(14)
===========================================================
===========================================================
===========================================================
=================
pib
17670552083.3333
7723840000
14707333333.3333
43717571428.5714
3708000000
110004333333.3333
28284500000
nom_departement
Ain
Aisne
Ain
Alpes-Maritimes
Alpes-de-Haute-Provence
Bouches-du-Rh?ne
Finist?re
1
4

Clustered Instances
0
1
2
3
4
5

25
27
7
20
3
14

(
(
(
(
(
(

26%)
28%)
7%)
21%)
3%)
15%)

K=7
kMeans
======
Number of iterations: 12
Within cluster sum of squared errors: 89.27102393425096
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
3
6
(23)
(20)
(14)

1
4

(18)
(3)

Full Data
2
5
(96)
(7)
(11)

pib
17670552083.3333
7465304347.8261
12792333333.3333
43717571428.5714
3708000000
110004333333.3333
29749727272.7273
18354714285.7143
nom_departement
Ain
Allier
Ain
Alpes-Maritimes
Alpes-de-Haute-Provence
Bouches-du-Rh?ne
Finist?re
Calvados

Clustered Instances
0
1
2
3
4
5
6

23
18
7
20
3
11
14

(
(
(
(
(
(
(

24%)
19%)
7%)
21%)
3%)
11%)
15%)

And K=10
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 86.26315726326193
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
3
6
9

1
4
7

Full Data
2
5
8

(96)
(16)
(8)
(7)
(19)
(3)
(9)
(12)
(9)
(5)
(8)
===========================================================
===========================================================
pib
17670552083.3333
6707000000
13605250000
43717571428.5714
3614684210.5263
110004333333.3333
31099777777.7778
16968916666.6667
11780333333.3333
23217000000
8733875000
nom_departement
Ain
Allier
Ain
Alpes-Maritimes
Alpes-de-Haute-Provence
Bouches-du-Rh?ne

Haute-Garonne
Aisne

Calvados
Finist?re

Charente

Clustered Instances
0
1
2
3
4
5
6
7
8
9

16
8
7
19
3
9
12
9
5
8

( 17%)
( 8%)
( 7%)
( 20%)
( 3%)
( 9%)
( 13%)
( 9%)
( 5%)
( 8%)

Marriages
K=2
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 204.48640349553273
Missing values globally replaced with mean/mode
Cluster centroids:
Attribute
1

Full Data

Cluster#
0

(200)
(150)
(50)
===========================================================
============
nbrdecesdomicillie
5273.02
3728.2133
9907.44
nbrnaissancesvivantedomicillie
8232.795
4988.06
17967
nbrmariages
2738.765
1862.4467
5367.72

nom_departement
Val-de-Marne

Val-d'Oise

Val-d'Oise

Clustered Instances
0
1

150 ( 75%)
50 ( 25%)

K=4
kMeans
======
Number of iterations: 12
Within cluster sum of squared errors: 196.85641391574097
Missing values globally replaced with mean/mode
Cluster centroids:
Attribute
1

Full Data
2

Cluster#
0

(200)
(38)
(16)
(71)
(75)
===========================================================
======================================
nbrdecesdomicillie
5273.02
8254.8421
12836.3125
4866.5493
2533.52
nbrnaissancesvivantedomicillie
8232.795
13968.2105
26404.75
6736.1127
2867.0267
nbrmariages
2738.765
5206.1579
5695.125
2441.3239
1139.5067
nom_departement
Val-d'Oise
Val-d'Oise
Val-de-Marne
Marne Haute-Marne

Clustered Instances
0
1
2
3

.
.
.
K=10

38
16
71
75

( 19%)
( 8%)
( 36%)
( 38%)

kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 183.90874037493157
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
0
4
8

1
5
9

2
6

Full Data
3
7
(200)
(41)
(16)

(16)
(10)
(8)
(12)
(34)
(4)
(22)
(37)
===========================================================
===========================================================
===========================================================
===============================
nbrdecesdomicillie
5273.02
6911.3125
15436.7
4210.75
5234.1951
3065.8333
8589.7059
1470
2498.6875
1555.8636
3579.4595
nbrnaissancesvivantedomicillie
8232.795
10702.6875
27517.5
5244.625
6986.2195
2728.4167
16988.5294
4585.75
3052.625
1468
4376.1351
nbrmariages
2738.765
2008.9375
5640.6
5096.875
2694.6341
985.8333
5501.7353
813
1220.125
715.9091
1906.3784
nom_departement
Val-d'Oise
Meuse
Paris
Oise
Aisne
Indre
Val-d'Oise Tarn-et-Garonne
Aube
Haute-Marne
Marne

Clustered Instances
0
1
2
3
4
5
6
7
8
9

16
10
8
41
12
34
4
16
22
37

(
(
(
(
(
(
(
(
(
(

8%)
5%)
4%)
21%)
6%)
17%)
2%)
8%)
11%)
19%)

CONCLUSION

This project certainly gave us a lot of trouble: some


problems were encountered during conceiving the Data
Warehouse and analyzing it, however, these problems
have been overcome and this is mainly thanks to the
support and assistance from members of the team.

However, this project allowed us to highlight the fact that


teamwork is the cornerstone of every labor.

Finally, we greatly appreciate the opportunity that was


given to us, since we could address issues of knowledge,
skills, adaptability, context and values.

You might also like