Professional Documents
Culture Documents
Submitted by
Under supervision of
TABLE OF CONTENTS
1.
2.
3.
4.
5.
6. Abstract
7. OBJECTIVE
9. DESIGN PRINCIPLES
12.FUTURE SCOPE
13. REFERENCES
Abstract:
Data set consists of Location of U.S. Farmers Market, Goods availability at the
market as per season. We have created a data mart that can provide the information
and answers questions. We have designed questions to address two types of users
Consumer and Government officials. For data mining project, we are working on
the same data to find patterns.
OBJECTIVE :
Background Information :
In data mining project we are mining US Farmers Market data to extract
knowledge. Here we are using WEKA tool to mine the data.
Data source for data is http://catalog.data.gov/dataset/farmers-markets-
geographic-data.
Original dataset consists of 8000 records with 41 different attributes related to
farmers market. Our primary goal is to use different mining tools to apply
classification and clustering algorithms.
Design Principles:
The design principles of this project included data cleaning and preprocessing.
The first phase of this project includes cleaning the data and makes it compatible to
data mining tool, the next phase is to apply data mining algorithms to get
classification and clustering results and study these algorithms.
The Data is cleaned and pre-processed manually by checking all the attribute entries
and made changes using Microsoft Office Excel.
Using ‘WEKA’-Data Mining tool, based on the structure and type of DB, we
applied following algorithms:
1. Classification Algorithms:
a. Logistic Algorithm
b. J48 (Decision Tree)
2. Clustering Algorithms:
a. Expectation Maximization (EM) Algorithm
b. K-Means Algorithm
Implementation:-
To mine data we have followed KDD process. Following are steps we followed:
1. Data Preprocessing:
As it is real time data, it is noisy data and need preprocessing.
To make it easy to handle, we have trimmed original data to 1907 rows.
We are using 35 attributes out of 41.
Season attribute was not consistent throughout the data. In some records it
was mention as date or duration of months. To make it consistent we added
two columns named Season start and season end.
Some special characters were used in data which is not accepted by Weka so
we remove these characters or replace with appropriate one.
2. Import preprocessed data in Weka.
3 Applied Classification and Clustering algorithms as mention below:
Based on the structure of Data Set and type of DB, specific algorithms can only
yield the results that interpret data well.
Classification Algorithms :
We used same database for data mining projects and data warehousing project. As
the database is very vast and distributive with many independent and with few
dependent attributes. After analyzing database, we come to conclusion that to
apply different data mining algorithms on different sets of attributes from the
database, to interpret data well.
Two broad sets formed for the data mining project are;
1. Goods Prediction and Clustering:
Location + Season Information + Goods Available
Basic Classification Histogram
In the above diagram we can select different goods from class and visualize
distribution of that selected good for all the states or season.
Red- Interprets particular good is available
Blue- Interprets particular good is not available
2. Nutrition Program Prediction and Clustering:
Location + Season Information + Nutrition Programs
For nutrition programs we find out what program is available at which
market location and during what season.
Thus from the above diagram we interpret that using ‘Logistic Classification
Algorithm’ can predict next/ future instance of ‘wine’ with ‘88.8%’ accuracy,
given the dependent relations among all the attributes, that we used for this
example.(location +season+all goods) .
Similarly, for the nutrition program we use ‘location + season + nutrition program’
related dataset.
And predict accuracy for the SFNMP in following example,
the algorithm can predicts future instance of SFNMP with ‘83.4%’ accuracy.
Logistic Algorithm for class SFNMP
From the above diagram, ‘Bake-goods’ can be predicted with ‘94%’ accuracy
using the attribute ‘Vegetables’ which is determined as ‘most relative’ by J48.
Decision Tree for Bake goods.
Where attribute ‘vegetable’ is not alone used to predict the ‘bakegoods’, but other ‘relevant’
attributes such as ‘prepared’ and ‘soap’.
Rules that can be formed from the above ‘decision tree’ are;
1. If Vegetables=Yes then Bake-goods=Yes
2. If Vegetables=No And Prepared=Yes then Bake-goods=Yes
3. If Vegetables=No And Prepared=No And Soap=Yes then Bake-goods=Yes
4. If Vegetables=No And Prepared=No And Soap=No then Bake-goods=No .
star schema is the simplest style of data mart schema and is the approach most widely used to
develop data warehouses and dimensional data marts. The star schema consists of one or
more fact tables referencing any number of dimension tables. The star schema is an important
special case of the snowflake schema, and is more effective for handling simpler queries.
The star schema gets its name from the physical model's resemblance to a star shape with a fact
table at its center and the dimension tables surrounding it representing the star's points. The star
schema separates business process data into facts, which hold the measurable, quantitative data
about a business, and dimensions which are descriptive attributes related to fact data. Examples
of fact data include sales price, sale quantity, and time, distance, speed, and weight
measurements. Related dimension attribute examples include product models, product colors,
product sizes, geographic locations, and salesperson names.
A star schema that has many dimensions is sometimes called a centipede schema. Having
dimensions of only a few attributes, while simpler to maintain, results in queries with many table
joins and makes the star schema less easy to use
Project Schedule
January-
Learning about data mining and its capabilities
February-
Defining the problem statement and understanding the application domain
March-
Gathering, preparing and exploring dataset.
Choosing the suitable data mining methods.
April-
Building and deploying data analysis model
May-
Consolidating the results and report generation
Work Division
Pawan:-
Formulating the problem hypothesis.
Gathering and preprocessing the data.
Irfan:-
Building and deploying the model for data mining.
Kamal:-
Future scope
Privileged user can insert new records in future.
Integrate Google Maps for location and directions.
Develop Mobile Application.
Apply UI Validations and filtering option on data.
Learning experience
Analytical processing
Learned Different data mining tools like Weka, Rapid Miner
Learned about Real time application for different data mining algorithms
F
Refrences:
http://catalog.data.gov/dataset/farmers-markets‐geographic‐data
The estimates of the productivity and production of various crops were made based on the
crop cutting experiment carried out with the joint efforts of the Departments of Economics
and Statistics, Agriculture and Horticulture and Plantation crops, Retrieved from
http://agritech.tnau.ac.in/pdf/2012/Season%20&%20Crop%20Report%202012.
pdf
Cunningham, S. J. and G. Holmes, 1999. Developing Innovative Applications in Agriculture
Using Data Mining. Department of Computer Science University of Waikato Hamilton, New
Zealand.