You are on page 1of 16

A Synopsis for the

Data mining in farmers market data

B.tech PBL Report

Submitted by

Under supervision of

SCHOOL OF COMPUTING SCIENCE AND ENGINEERING


GALGOTIAS UNIVERSITY
GREATER NOIDA, UTTAR PRADESH, INDIA
APRIL 2017

TABLE OF CONTENTS
1.
2.
3.
4.
5.
6. Abstract

7. OBJECTIVE

8. BACK GROUND INFORMATION

9. DESIGN PRINCIPLES

10.IMPLEMENTATION ISSUES AND SOLUTIONS

11.SUMMARY OF LEARNING EXPERIENCE

12.FUTURE SCOPE

13. REFERENCES
Abstract:

Data set consists of Location of U.S. Farmers Market, Goods availability at the
market as per season. We have created a data mart that can provide the information
and answers questions. We have designed questions to address two types of users
Consumer and Government officials. For data mining project, we are working on
the same data to find patterns.

OBJECTIVE :

 Mining data to extract knowledge from available data.

 Explore different data mining tools.

 Apply different data mining algorithms to US Farmers Market Data.

Background Information :
In data mining project we are mining US Farmers Market data to extract
knowledge. Here we are using WEKA tool to mine the data.
Data source for data is http://catalog.data.gov/dataset/farmers-markets-
geographic-data.
Original dataset consists of 8000 records with 41 different attributes related to
farmers market. Our primary goal is to use different mining tools to apply
classification and clustering algorithms.

Design Principles:
The design principles of this project included data cleaning and preprocessing.
The first phase of this project includes cleaning the data and makes it compatible to
data mining tool, the next phase is to apply data mining algorithms to get
classification and clustering results and study these algorithms.
The Data is cleaned and pre-processed manually by checking all the attribute entries
and made changes using Microsoft Office Excel.
Using ‘WEKA’-Data Mining tool, based on the structure and type of DB, we
applied following algorithms:

1. Classification Algorithms:
a. Logistic Algorithm
b. J48 (Decision Tree)

2. Clustering Algorithms:
a. Expectation Maximization (EM) Algorithm
b. K-Means Algorithm

Tool Used- weka


weka Features:-
 Weka is one of the most widely used analytics platforms in the world,
with over 250,000 users.
 Organizations of all sizes use weka, and its range of application is very
broad.
 many predictive models can be built without resorting to program code
 This is a sophisticated offering with over 1500 drag-and-drop operators.
 Novice users can quickly get up to speed with weka ‘Wisdom of Crowds’
online repository of best practices
 Big data is well accommodated through its hadoop platform – insulating
users from the complexities and volatility of big data technologies.

Current Applications and implementations of the


tool :–
weka is used in every conceivable industry, from cement manufacturing through
to electronic payment companies. This range of applications demonstrates very
well the versatility of the platform, and both medium and large businesses
benefit from its low cost of entry and sophisticated capabilities .One of the most
high profile users of Rapid Miner is PayPal. It needed to get an inside track on
customer churn. The text analytics capabilities of Rapid Miner were used to
classify customers as ‘top promoters’ and ‘top detractors’ by analyzing
feedback. This in turn enabled product managers to take action when negative
sentiment was detected. A good example of the result of this analytics was the
identification of problems associated with passwords. Changes were made and a
corresponding drop in negative comment was the outcome.

Implementation:-

To mine data we have followed KDD process. Following are steps we followed:
1. Data Preprocessing:
 As it is real time data, it is noisy data and need preprocessing.
 To make it easy to handle, we have trimmed original data to 1907 rows.
 We are using 35 attributes out of 41.
 Season attribute was not consistent throughout the data. In some records it
was mention as date or duration of months. To make it consistent we added
two columns named Season start and season end.
 Some special characters were used in data which is not accepted by Weka so
we remove these characters or replace with appropriate one.
2. Import preprocessed data in Weka.
3 Applied Classification and Clustering algorithms as mention below:

Based on the structure of Data Set and type of DB, specific algorithms can only
yield the results that interpret data well.
Classification Algorithms :

We used same database for data mining projects and data warehousing project. As
the database is very vast and distributive with many independent and with few
dependent attributes. After analyzing database, we come to conclusion that to
apply different data mining algorithms on different sets of attributes from the
database, to interpret data well.
Two broad sets formed for the data mining project are;
1. Goods Prediction and Clustering:
Location + Season Information + Goods Available
Basic Classification Histogram
In the above diagram we can select different goods from class and visualize
distribution of that selected good for all the states or season.
Red- Interprets particular good is available
Blue- Interprets particular good is not available
2. Nutrition Program Prediction and Clustering:
Location + Season Information + Nutrition Programs
For nutrition programs we find out what program is available at which
market location and during what season.

Red- Interprets particular nutrition program is available


Blue- Interprets particular nutrition program is not available
Logistic Algorithm:

Highly regarded classical statistical technique for making predictions.


‘Logistic Algorithm’ assigns weightage to the attributes in the Data Set. And uses
the ‘logistic regression formula’ to predict how accurately a particular attribute
value can be determined for the future instances. Thus using relative
(interdependent) attributes increases prediction capability as oppose to using all
the data available. Since using independent attributes would affect assignment of
weightage which is used to formulate the prediction accuracy. To apply logistic
algorithm classification on ‘Goods data’ Set of relevant attributes i.e. dependent
are used. Logistic algorithm then assigns weightage to all attributes in dataset.
Then these weightages are run through ‘logistic regression formula’ to predict the attribute
under consideration in this example ‘wine’.
Logistic Algorithm for class Wine

Thus from the above diagram we interpret that using ‘Logistic Classification
Algorithm’ can predict next/ future instance of ‘wine’ with ‘88.8%’ accuracy,
given the dependent relations among all the attributes, that we used for this
example.(location +season+all goods) .
Similarly, for the nutrition program we use ‘location + season + nutrition program’
related dataset.
And predict accuracy for the SFNMP in following example,
the algorithm can predicts future instance of SFNMP with ‘83.4%’ accuracy.
Logistic Algorithm for class SFNMP

Logistic Algorithm for class WICcash


J48 Algorithm (Decision Tree):
Logistic Algorithms cannot predict ‘numeric values’. Whereas J48 Algorithm can
predict both ‘nominal’ and ‘numeric’ attribute values.
J48 algorithm uses ‘most relevant attribute’ from the dataset to determine the
prediction values, thus it’s better to have all the attributes rather that only relevant
attributes, as we did in logistic algorithm.
Using all the data set for J48 Algorithm, the prediction efficiency increases.
J48 Algorithms visualizes result in the form of ‘Decision Tree’, where most
relevant attributes are used for prediction of particular attribute’s future-instance
value. Using this tree rules can be formed

J48 Algorithm on Bake-goods

From the above diagram, ‘Bake-goods’ can be predicted with ‘94%’ accuracy
using the attribute ‘Vegetables’ which is determined as ‘most relative’ by J48.
Decision Tree for Bake goods.

Where attribute ‘vegetable’ is not alone used to predict the ‘bakegoods’, but other ‘relevant’
attributes such as ‘prepared’ and ‘soap’.
Rules that can be formed from the above ‘decision tree’ are;
1. If Vegetables=Yes then Bake-goods=Yes
2. If Vegetables=No And Prepared=Yes then Bake-goods=Yes
3. If Vegetables=No And Prepared=No And Soap=Yes then Bake-goods=Yes
4. If Vegetables=No And Prepared=No And Soap=No then Bake-goods=No .

Decision Tree for Herbs class


Data mart
 Data mart is implemented on star schema base
 Data Mart provided following information to user
 Market Name, Address, Goods and Nutrition program available, Season
details on basis of below attributes.
 State
 City
 Goods
 Nutrition Program
 Season Duration
 Location Type
Star schema

star schema is the simplest style of data mart schema and is the approach most widely used to
develop data warehouses and dimensional data marts. The star schema consists of one or
more fact tables referencing any number of dimension tables. The star schema is an important
special case of the snowflake schema, and is more effective for handling simpler queries.
The star schema gets its name from the physical model's resemblance to a star shape with a fact
table at its center and the dimension tables surrounding it representing the star's points. The star
schema separates business process data into facts, which hold the measurable, quantitative data
about a business, and dimensions which are descriptive attributes related to fact data. Examples
of fact data include sales price, sale quantity, and time, distance, speed, and weight
measurements. Related dimension attribute examples include product models, product colors,
product sizes, geographic locations, and salesperson names.
A star schema that has many dimensions is sometimes called a centipede schema. Having
dimensions of only a few attributes, while simpler to maintain, results in queries with many table
joins and makes the star schema less easy to use
Project Schedule

January-
Learning about data mining and its capabilities

February-
Defining the problem statement and understanding the application domain

March-
Gathering, preparing and exploring dataset.
Choosing the suitable data mining methods.

April-
Building and deploying data analysis model
May-
Consolidating the results and report generation

Work Division

Pawan:-
Formulating the problem hypothesis.
Gathering and preprocessing the data.
Irfan:-
Building and deploying the model for data mining.
Kamal:-

Interpreting the mined patterns and consolidating the final results.

Future scope
 Privileged user can insert new records in future.
 Integrate Google Maps for location and directions.
 Develop Mobile Application.
 Apply UI Validations and filtering option on data.

Learning experience
 Analytical processing
 Learned Different data mining tools like Weka, Rapid Miner
 Learned about Real time application for different data mining algorithms
F
Refrences:
 http://catalog.data.gov/dataset/farmers-markets‐geographic‐data

 The estimates of the productivity and production of various crops were made based on the
crop cutting experiment carried out with the joint efforts of the Departments of Economics
and Statistics, Agriculture and Horticulture and Plantation crops, Retrieved from
http://agritech.tnau.ac.in/pdf/2012/Season%20&%20Crop%20Report%202012.
pdf
 Cunningham, S. J. and G. Holmes, 1999. Developing Innovative Applications in Agriculture
Using Data Mining. Department of Computer Science University of Waikato Hamilton, New
Zealand.

 Agriculture crop pattern using data mining - ijarcsse


https://www.ijarcsse.com/docs/papers/Volume_6/5_May2016/V6I5-0245.pdf

You might also like