You are on page 1of 3

1. Data mining - regression problems .

Estimating the value of the property . The


possibility of using STATISTICA in neural
modelling .

Acquaintance with the dataset "CALIFORNIA "

The considered dataset comes from the census carried out in California in 1990. The
entire state has been divided into more than 20,000 census areas . After removing records
with missing data, there are 20640 obtained observations relating to particular areas.
Each observation contains information on:
A. the median age of real estate in the area,
B. the total number of rooms,
C. the total number of bedrooms ,
D. size of population in a given area ,
E. the number of households in a given area ,
F. median size of household income ,
G. median price of residential real estate .
The G. variable is treated as the dependent (output) variable, i.e. a variable whose value
will be estimated by the constructed model . Variables A.-F. are the independent
variables (inputs) , describing the characteristics of individual objects , these features
affect the price of the property (the value of G) .
For the construction ( learning with simultaneous validation ) and testing of the neural
network, we will use a set of Kalifornia1.xls, containing 20630 observations. After
selecting the most effective model ( network) and the optimization of its parameters, we
will apply this model to estimate the value of the variable G. for 10 cases available in the
set Kalifornia2.xls .

Defining the problem - constructing the model for estimation of the
property value
1. Run the program STATISTICA . Open the file Kalifornia1.xls .
2. Refer to the module "Automatic neural networks ."
3. On the basis of data available in the set Kalifornia1.xls construct the most efficient
model for estimating the value of the property ( estimating the value of a variable G,
i.e., the median value of real estate in a particular area of California ) . Use both
perceptron networks and RBF .
4. Execute additional optimization of the network structure and of the parameters of the
learning algorithm (use also the " User Project " ). Note the results of experiments
concerning this selection .
5. Based on the most effective neural network (in terms of network quality for the set
used for the final evaluation of the effectiveness and the ability to generalize ) ,
determine the values of the dependent variable (G ) for of the 10 cases included in the
set Kalifornia2.xls .


2. Data mining - regression problems the
project . Forecasting the volume of electricity
sales.

Execute construction of an effective model for forecasting daily sale of electricity. The data used
for the construction and evaluation of the model (learning , validation, testing) are included in
the file energ.xls
Next, the constructed optimal neural model is to be used to generate predictions of daily sales (
consumption ) of electricity for the three cases (days) specified in the set prognoza.xls .
Prepare project documentation containing the following ( numbered ) parts:
1. Understanding business conditions (stage I CRISP - DM).
2. Understanding the data (stage II ) ,
3. Selection of input data , data preparation .
4. Modelling . Description ( in points) of the conducted research concerning optimization
of the model parameters, along with the obtained results of learning and testing
(validation) i.e. errors RMSE and MAE ( develop numerical results in tabular form ! )
5. Evaluation of the model ( step 5 CRISP -DM ) , possible return to previous steps and
repeating the modelling process .
6. Ultimately adopted structure of the model , together with the adopted parameters and
obtained values of effectiveness measures (errors RMSE and MAE )
7. Own comments and conclusions about the effectiveness of the model chosen,
8. Numerical forecasts for 3 cases ( days ) from file prognoza.xls .

Please pay special attention to the correctness of the model construction methodology
(implementation of the various stages of the CRISP -DM ) , and the corresponding range of used
tools and analyzes carried out.
One project may be carried out by 2 persons.
Rating: 0 - 20 points . (% - see the document "Conditions of completing the course" -
Moodle) , wherein:
- 7 points - the substantive values of the project ,
- 7 points - the effectiveness of the constructed model , expressed as accuracy of 3
forecasts for the file prognoza.xls in terms of error MAE ,
- 6 points - for formal site of the project documentation .
Project documentation must be submitted in electronic form ( .doc or .pdf format) using
the tools available on the course ( Moodle ) within the prescribed period . For each day of delay
shall be deducted 2 points .

Literature:
Inteligentne systemy w zarzdzaniu. Teoria i praktyka (in Polish) editor Jerzy S.
Zieliski, PWN, W-wa, 2000, chapters 5.1, 5.2, 5.3, 5.6.1.2.
Larose D.T. , Discovering Knowledge in Data. An Introduction to Data Mining , Wiley,
2005 .

You might also like