You are on page 1of 2

COMP 4710 Assignment 1 Clustering

Total Marks: [30]


For this assignment, you are provided with data corresponding to accelerometer measurements from a
cell phone during different activities. (Walking, standing, etc.).
The data that is provided is as follows:
phonedata.txt Measurements from accelerometer sensors on a phone during different
activities
activitylabels.txt Labels indicating the activity being performed. Note that typically for
clustering, no labels are provided. In this case, the labels are provided to help understand
strengths and limitations of clustering.
Your goal in this assignment is to implement and apply clustering algorithms to discover structure in the
phone data. Your implementation should be written in python. Individual tasks and corresponding
marks follow:
1. Read the data [4 Marks]
Tasks
i. Write a function to read the data that is provided into matrices.
ii. Read the accelerometer measurement data, and activity labels into matrix variables
2. Examine the first 2 principal components associated with the data [4 Marks]
Each measurement consists of more than 500 dimensions (accelerometer measurements).
In this raw form, its difficult to examine the structure of the data visually. One strategy that
can be applied is to consider a 2D projection of the data onto a plane. Using PCA, the
variance in the point locations on this plane is maximized.
Tasks
i. Apply PCA to the data (down to 2 dimensions)
ii. Visualize this as a 2D scatter plot
iii. For clustering, typically data is unlabeled. However, as described above here you are
given labels. Color the observations in the scatter plot according to the motion type
(walking, standing, etc.) to see their relative positions.
3. Implement K-Means [14 Marks]
In this step you will implement the k-means algorithm. Your code should use the data from

step 1, and assign a cluster to each data sample according to k-means. Youll want to allow
for both k, and the number of iterations to be variables that can be altered in your code.

Tasks
i. Implement K-means clustering
ii. Compare the clusters produced for different values of k (e.g. 2, 6, 12)
iii. Output the sum of squared distances (SSE) for each cluster, and an overall score
4. Initialization [4 Marks]
We have seen in class that the behavior of k-means can be dependent on the initialization
stage i.e. where the initial cluster centers are placed.
Tasks
i.

Include a mechanism for initializing the cluster centers that adds stability to the
resulting cluster assignments. You should justify that this works based on the output
of part 3.iii.

5. Analysis [4 Marks]
Tasks
i. Comment on how well the data is separated into distinct clusters (remember also that
you dont typically have labels when clustering)
ii. Samples in this data are sequential, but are being treated as independent
observations. How might this knowledge be included to produce a different result from
clustering?

TO HAND IN
You should hand in a .zip file named yourlastname_yourfirstname_A1.zip to the D2L dropbox.
This .zip should contain the following:
1. A python (.py) file named myKMeans.py that implements the steps outlined above.
2. A pdf file (yourlastname_yourfirstname.pdf) that includes the following:
a. A visualization of the scatter plot produced in 2.ii., 2.iii. (1 plot is fine)
b. 1 paragraph describing your solution to part 4. (If not implemented say so here)
c. 1 paragraph with your comments on 5.i.
d. 1 paragraph with your comments on 5.ii.

You might also like