You are on page 1of 9

STSCI4060-HW5

April 24, 2016

STSCI 4060 HW5: scipy, numpy, matplotlib, etc.

In [2]: %matplotlib inline


%config InlineBackend.figure_format = svg
from __future__ import division

1.1

Q1: Linear regression

We all know that in linear regression, the model is set up to be Y X + , where Y is the dependent
variable, X the data/design matrix of independent variables, the parameter (a vector), and  the noise
random variable.
1.1.1

Let our model simply be Y 0 + 1 X +  (X is now a scalar). We have three data points (1,4), (2,3)
and (3,10), as in the figure below. First find the best fitting line of the three points, and plot the line out.
In [4]:

1.1.2

You probably have also seen the following geometric interpretation of linear regression: the design matrix X
spans a hyperplane and the best fit is chosen on the plane that minimizes its distance to the data Y , which
I have plotted out the plane and projection for you,
means, X (the best fit) is perpendicular to Y X .
using the three points above: the red line segment represents Y , the green line segment X and the magenta
Y X should be perpendicular to X .

line segment Y X ;
Now can you numerically verify that the two vectors are indeed perpendicular?
1

In [ ]:
1.1.3

Linear regression is the solution to an optimization problem. For a general optimization problem, one can
plot out the contours of the objective function to be optimized (like this one; in our case, the objective
function is the sum-of-squares): it is very helpful in designing algorithms to search for the optima. Can you
plot out the contour in the space of 0 and 1 of our linear regression problem?
In [ ]:
1.1.4

Bonus question (5 pts): Can you make a 3D surface plot of the contour? That is, the graph of sum of squares
as a function of 0 and 1 .
In [ ]:
1.1.5

Appendix: the codes for plotting the plane in 3D

You need to disable the %matplotlib inline flag to have a pop-up window of the 3D plot and then rotate
it to get the right angle.
In [8]: fig = plt.figure()
ax = fig.add_subplot(111, projection=3d)
xyzss = np.array(map(lambda beta: np.array([1,1,1])*beta[0] + x*beta[1],
zip(beta0ss.flatten(), beta1ss.flatten())))
n = len(beta0s)
xss = xyzss[:,0].reshape(n,n)
yss = xyzss[:,1].reshape(n,n)
zss = xyzss[:,2].reshape(n,n)
ax.plot_surface(xss, yss, zss, alpha=0.2, cstride=100, rstride=100)
ax.plot(*zip([0,0,0], y), color=r)
ax.plot(*zip([0,0,0], y_fit), color=g)
ax.plot(*zip(y, y_fit), color=m)
plt.show()
plt.close()

1.2

Q2: Clustering Iris

K-means clustering is one of the most popular data mining algorithms and should be in every data scientists
toolbox. In this exercise, we will apply the algorithm to the classic dataset that goes back to our hero R.
A. Fisher. The dataset contains quantitative measurements of some traits of three flower species, and the
question is whether we can tell the flower species apart by just analyzing the measurements (that is, without
years of training in plant taxonomy).
1.2.1

Download
and
process
the
data
from
http://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data. You can either download the file to your computer and use function open,
or use Pythons built-in module urllib2.
In [9]:
1.2.2

As the wikipedia page of the dataset explains, the four columns correspond to four features of a flower:
sepal length, sepal width, petal length and petal width. So one flower is represented by a point in the fourdimensional space. For our purpose of demonstrating the k-means clustering algorithm though, it would be
nice if the feature space were two-dimensional, so that we can plot the points on a plane. So we are going
to use only part of the dataset.
There is some freedom in reducing a four-dimensional data point to a two-dimensional one: we can
use only the sepal data, or only the petal data, among other choices. In fact this choice can be important!
Suppose petal is a bad feature for classifying different flower species: they tend to share the same petal length
and width. Then keeping petal data in the dataset would likely worsen the performance of the clustering
algorithm. You get the idea: we want to use features that are most informative, ie, different between species.
I could have let you try out a bunch of choices and see which choice of two features is most informative.
But for now lets keep the task simple: we want to see how well the choice is using the sepal area and petal
area, defined by the product of length and width.

Task: Apply the k-means clustering algorithm to the dataset using sepal and petal areas. Scipy and
scikit-learn have an implementation of the algorithm.
In [10]:
1.2.3

Plot the results: create two subplots, one with true labels and the other with labels returned by the k-means
algorithm; in each subplot, scatter plot the 2-d features with label information color-coded (that is, different
colors correspond to different labels). I have created an example plot using features of petal length and
width below (the stars are the centroids of the clusters).
In [13]:

1.2.4

Run the k-means clustering algorithm again and plot the results. Are they any different from the previous
one? Why is that?
In [ ]:

1.3

Q3: Mysterious distribution

The Earths most powerful radio telescope recently intercepted some signal, apparently from a far-away
galaxy and probably from an intelligent civilization. The signal, in numeric form, looks like the following.
We ought to make sense of it the fate of our whole civilization may depend on it. As statisticians, you are
given the following task: do the numbers follow some distribution?
To help you present your findings, you need to achieve the following data visualization: - Plot the
histogram of the data - Try a variety of distributions in scipy for fitting the data and visualize the fit along
with the histogram.
An example of such a visualization is presented below, using a different set of data from another region
of the universe.

Tip 1: A message just came from the Earths Defense Council: Dear Earth-saving Statisticians:
Another signal was just intercepted and may or may not help you decipher the previous signal; it reads
HW3Q2.
Tip 2: You can create as many plots as you want to explore the distributions in scipy.
Tip 3: To help examine the fit, you can zoom in by setting ax.set xlim in the figure.
Finally, conclude which distribution the data likely comes from.
In [32]: signal
Out[32]: array([ 1.44453212,
1.31029573,
1.13902874,
1.21499325,
1.07685939,
1.16959163,
1.01877994,
1.14315296,
1.54498746,
1.07501138,
1.79090731,
1.41843107,
1.84466076,
1.08570246,
1.93557598,
2.48311135,
1.50190593,
1.17280419,
1.47325052,
2.21940257,
1.11925996,
1.32364237,
1.40234143,
1.20502685,
2.93439766,
1.04507486,
1.28666848,
1.14346755,
1.3785449 ,
1.67940632,
1.25431545,
1.46565732,
1.15716268,
1.22898523,
1.49406136,
1.03084654,
1.09338998,
1.01105574,
1.27567015,
1.09969574,
1.41425302,
1.05519743,
1.52779114,
1.12584717,
1.22893581,

1.18258002,
1.5952089 ,
1.12016579,
1.05489324,
1.79568456,
2.30840447,
1.12818187,
1.37209311,
1.13015779,
1.05983836,
1.23922153,
1.12086361,
1.33057553,
1.75185286,
1.01912749,
1.00690397,
1.09487381,
1.04843452,
1.15575288,
2.38249018,
1.24432807,
1.22840459,
1.32875911,
1.37657142,
1.12953569,
1.89915929,
1.31775002,
1.62004183,
1.07139888,
1.44098683,
1.300809 ,
1.14248455,
1.16287548,
1.18342923,
1.46457628,
1.30211605,
1.35216939,
2.11671164,
1.8243831 ,
2.28423817,
1.02166456,
1.16565928,
1.76333605,
1.14428574,
1.15000807,

1.03805217,
1.65411452,
1.32608345,
1.10231749,
1.03640403,
3.14355313,
1.11233639,
1.0066668 ,
1.00990086,
1.87301605,
1.42016309,
1.39185735,
1.14603421,
1.56981034,
1.08129979,
1.55751516,
1.04217224,
1.20721447,
1.4199131 ,
1.2787506 ,
1.09014317,
1.44669374,
1.1903764 ,
1.14025011,
1.01433269,
1.00040325,
1.50565365,
1.24481398,
1.36404892,
1.01151223,
1.57419071,
1.52089463,
1.11840218,
1.83057876,
1.07788144,
1.01585276,
1.1287503 ,
1.60810814,
1.45497029,
1.09114628,
2.14374467,
1.43272015,
1.05078056,
1.2812132 ,
2.35666284,
5

1.28821731,
2.04662766,
3.22552144,
1.84081581,
1.17904747,
1.0328244 ,
1.06165922,
1.06904307,
1.48663454,
1.27358199,
1.56024695,
1.03406049,
1.04394476,
1.01324134,
1.02580712,
1.01597443,
1.05702755,
1.49912473,
1.00133827,
4.11952418,
1.04116853,
2.14914062,
1.03822263,
1.14132957,
1.41002729,
1.71601967,
1.08036611,
1.6159459 ,
1.1035424 ,
1.29900479,
1.16892862,
1.12422147,
1.01901151,
1.65201986,
1.62379881,
1.0941412 ,
1.2560944 ,
1.22746707,
1.34461117,
1.02924567,
1.96345001,
1.23355934,
1.28866482,
1.01756206,
1.00451665,

1.13498651,
1.28437529,
1.21360913,
2.60131908,
1.33886448,
1.0247849 ,
1.32419206,
1.21639181,
1.03019391,
1.50987017,
1.14505021,
1.63893923,
1.08332697,
1.20602076,
1.3441864 ,
1.07798696,
1.06766593,
1.21099451,
1.64895644,
2.44680645,
1.15459539,
1.01324296,
1.04142651,
1.05485122,
1.02221429,
1.03895163,
1.08291891,
1.07471603,
1.18653354,
1.50316092,
1.04766515,
1.46064678,
1.21986849,
1.64264932,
1.23631498,
1.24179794,
1.03037496,
1.05268564,
1.25263157,
1.33868055,
1.01076355,
1.28724961,
1.4574198 ,
1.24993151,
1.01586006,

1.03209185,
1.03165413,
1.65651277,
2.03485676,
1.47056952,
1.17867936,
2.59135128,
1.24603878,
1.18865189,
1.0293775 ,
1.15394333,
1.18475302,
1.24209899,
2.93501019,
1.5341392 ,
1.42395027,
1.22760147,
1.34082793,
1.53271542,
1.01617757,
1.12952316,
1.19639675,
1.07955424,
1.17168258,
1.27305039,
1.0217836 ,
1.25571998,
2.44719721,
1.2555586 ,
1.04662299,
1.19984274,
1.03309516,
1.06237822,
1.51049757,
1.03256088,
1.18022364,
1.27580795,
1.61109348,
1.17293644,
1.56594377,
1.06633337,
1.07410132,
1.31333614,
1.02686182,
1.59426373,
1.07059047,
1.04875618,
1.28572718,
1.49484075,
1.0669342 ,
1.01558494,
1.01339323,
2.58213476,
1.16209374,

1.01501507,
1.00421736,
1.27457409,
1.24891475,
1.10871342,
5.46029901,
1.1044311 ,
1.00690538,
1.00962163,
1.51887637,
1.09577528,
1.02546523,
1.08075255,
2.84750028,
1.55527989,
1.22698314,
1.22791248,
1.17130741,
1.01933396,
1.38299598,
1.00225794,
1.10408172,
1.0411682 ,
1.01479641,
1.05019362,
1.52672752,
1.12122971,
1.1526257 ,
1.02402354,
1.01621595,
1.22912968,
1.54358701,
1.37362019,
1.57591102,
1.7564268 ,
1.43823141,
3.41605249,
1.63924406,
1.33843962,
2.00367003,
2.52764258,
1.39873728,
1.96754908,
2.66743181,
1.76221469,
1.02360085,
1.95251048,
2.00756361,
1.05705728,
1.21611544,
1.0105134 ,
1.9084003 ,
1.14104331,
1.00242495,

1.08105887,
1.00478351,
1.46030544,
1.03753238,
1.03684134,
1.89872266,
1.16966667,
2.37765828,
1.07820165,
1.0363876 ,
1.61249966,
1.01867818,
1.3217345 ,
1.06550546,
2.44856734,
1.01357069,
1.22229178,
1.13065104,
1.32101685,
1.15703747,
1.24101006,
1.00410173,
1.03860604,
1.24795363,
1.07853033,
1.39305883,
1.01247142,
1.27089027,
1.01796652,
1.04998689,
1.00535462,
1.54936218,
1.01306165,
1.02022527,
1.47161263,
1.98235651,
1.3891879 ,
1.04776302,
1.85573092,
1.08005327,
1.80967039,
1.17615383,
1.27575065,
1.49988162,
1.13053833,
1.98714012,
1.09297224,
1.50534382,
1.37092179,
1.27875015,
1.06392521,
1.14199345,
1.12144287,
1.01495586,

1.0375189 ,
5.63202967,
1.54235761,
1.20053678,
1.05495356,
1.00779214,
1.09504863,
1.20228452,
1.1290459 ,
1.13305883,
1.14713077,
1.71465525,
1.15080532,
1.06139283,
1.46332288,
1.10238537,
1.43077928,
1.07304807,
1.32435977,
1.33079425,
1.21536671,
1.08807325,
1.03209221,
1.16536836,
1.03926949,
1.09628643,
1.02155374,
1.3734149 ,
1.10552344,
1.19854599,
1.15231909,
1.53813402,
1.2391745 ,
1.31031405,
1.07307378,
1.11088493,
1.50648831,
2.42919856,
1.31772002,
1.25465522,
2.23305935,
1.03103081,
1.74059899,
1.00286209,
1.97958926,
1.05543159,
1.06678494,
1.84392152,
1.28509684,
1.0383276 ,
2.65297843,
1.24155876,
1.01993451,
1.14763459,

1.18614703,
1.26039911,
1.02280989,
1.03142282,
1.47870784,
1.06337341,
1.36072914,
1.2998942 ,
1.64954777,
1.2443947 ,
1.07769587,
2.97794669,
1.52862688,
1.56688688,
1.70358137,
1.00058879,
2.44747457,
1.07532599,
1.37671972,
1.47887067,
1.0254581 ,
1.15892022,
1.1350999 ,
1.05817675,
1.04601681,
1.12274672,
1.11371042,
1.40568478,
1.08685349,
1.18777811,
1.04485695,
1.0082856 ,
1.10923518,
1.02146551,
1.209285 ,
1.48327823,
1.55524003,
1.90753122,
1.10790214,
1.08744892,
2.26987092,
1.00591014,
1.11216199,
1.39115214,
1.26176032,
1.24071326,
1.15682994,
1.00121979,
1.09791125,
1.02422354,
1.21367821,
1.10717761,
1.09259185,
1.70935546,

1.31387709,
1.09509002,
1.10520341,
1.00486334,
1.23911838,
1.93676678,
1.39089679,
1.43118545,
1.21863161,
1.41295476,
1.47815766,
1.00177134,
1.0614147 ,
1.14545512,
1.60070863,
1.27851209,
1.01560319,
1.01154308,
1.28680385,
1.00072029,
1.49140271,
1.44156718,
1.09346738,
1.39961941,
1.00382282,
1.04844115,
1.16038242,
1.15755387,
1.10234973,
1.01172563,
1.3658791 ,
1.2530971 ,
1.03461859,
1.08992631,
1.05112267,
1.22827996,
1.03614818,
1.07040393,
1.05512077,
1.47027123,
2.4575141 ,
1.21283806,
1.13924088,
1.39810622,
1.14728435,
1.05618988,
1.95035031,
1.2345179 ,
1.54390974,
1.28047352,
1.47926924,
1.19636815,
1.12669513,
1.0891727 ,

1.425331 ,
1.21638581,
1.01171722,
1.18134442,
1.10572272,
1.58353646,
1.23845997,
1.55825178,
1.18649518,
1.00724167,
1.47954415,
2.82874064,
1.17974741,
3.64254994,
1.11802302,
1.0885582 ,
1.1659031 ,
1.08793239,
1.11265149,
1.30138494,
1.20825654,
1.72271344,
1.46020062,
1.18317539,
1.29311886,
1.97132441,
1.14917275,
1.44006377,
1.41775556,
1.4330961 ,
1.01320724,
1.6657984 ,
1.08315082,
1.19131367,
1.02576337,
1.45951389,
1.10637239,
1.05910829,
2.17069366,
1.00220003,
1.11025746,
1.01368882,
1.11022654,
1.19301547,
2.16278657,
1.22295398,
1.37581792,
1.12573866,
1.31123082,
1.23240546,
1.04476784,
1.57735747,
1.02623638,
1.17133224,

1.08950402,
1.13351737,
1.71885653,
1.36394093,
1.23173362,
1.62059538,
1.15511487,
1.3249483 ,
2.98197871,
1.00352313,
1.04273363,
1.0258243 ,
1.06213563,
1.09910778,
2.38421288,
1.06364424,
1.32574193,
1.04523562,
1.32627283,
1.2729427 ,
1.07370304,
1.43219064,
1.21174856,
2.11408944,
1.424391 ,
1.034448 ,
1.40651287,
1.07048106,
1.04892445,
1.0194169 ,
1.14587868,
1.11830339,
1.26848889,
1.03305638,
1.38488803,
1.21446323,
1.09258428,
1.29679733,
1.16984553,
1.14555249,
1.39171526,
1.14484534,
1.06562834,
1.05734861,
1.16221177,
1.01863807,
1.59997181,
1.94424766,
1.6089754 ,
1.08021956,
1.01578032,
1.00639764,
1.06872586,
1.42452236,

1.28232391,
1.01140047,
1.12952698,
1.0216147 ,
1.17937227,
1.18831771,
1.0404595 ,
1.16293582,
1.32669866,
1.45902222,
1.01096984,
1.24661972,
1.05028958,
1.50451178,
1.67927592,
1.04414479,
1.94748268,
1.16269232,
1.1658267 ,
1.21741888,
1.15169193,
1.20513816,
1.16524324,
1.87092859,
1.19184708,
1.2816377 ,
1.1765528 ,
1.06635321,
1.22248377,
2.10440164,
1.01980165,
1.1078233 ,
1.44457731,
1.07423198,
1.13964961,
1.06678144,
1.11960838,
1.21231128,
1.20956943,
2.16051591,
1.99346342,
1.41869765,
1.48796921,
1.0251517 ,
1.0170617 ,
1.82228497,
1.16871896,
1.0377868 ,
1.01553597,
1.13483553,
1.43035108,
1.0231024 ,
1.01370443,
1.05418585,

1.05075217,
1.12936435,
1.048548 ,
1.12879849,
1.71828911,
1.17494036,
1.0614151 ,
1.22759895,
1.50417286,
1.09701677,
1.13677812,
1.04191608,
1.51540798,
1.04886102,
1.14008191,
1.10482801,
1.25285764,
1.00706923,
1.27926552,
1.63607723,
1.00581823,
1.05585011,
1.14656712,
1.62177278,
1.49166116,
2.11883346,
1.57903226,
1.00602118,
1.33830155,
1.36122223,
1.31439656,
2.25035605,
1.00878055,
2.41050275,
1.15391578,
1.26014077,
1.35152898,
1.15257201,
1.20464072,
2.95013602,
1.19997358,
1.37537023,
1.17623061,
2.26642566,
1.14846965,
1.16530899,
1.20091774,
4.15068131,
1.00240787,
1.31800684,
2.27112475,
1.06562435,
1.24875747,
1.15805445,

1.50540285,
1.5048435 ,
1.1238637 ,
1.30369065,
1.37532721,
1.63944714,
1.59342054,
1.10397704,
2.09530775,
1.00635476,
1.40257272,
1.20392303,
1.00376977,
1.29114942,
1.2719005 ,
1.3098617 ,
1.32193094,
1.01107827,
1.70660582,
1.38003164,
1.6351821 ,
1.30340664,
3.11030369,
1.20558642,
2.10382866,
1.17246329,
1.34051959,
1.48153744,
2.07002931,
1.54513521,
1.04207303,
1.12255889,
1.05513737,
1.21415415,
1.40126187,
1.54730459,
1.08565283,
1.09576236,
1.3295328 ,
1.00327452,
2.41822565,
1.05815056,
1.10657036,
1.01246734,
1.07475663,
1.06421576,
1.17378432,

1.05488569,
1.53888932,
1.25113308,
1.01331162,
1.10118553,
1.1676247 ,
4.99415332,
1.15337246,
1.326607 ,
1.40818503,
1.14394073,
1.22116946,
1.04547742,
1.68235567,
1.10916612,
1.36333065,
1.57072353,
1.35955945,
1.30518619,
1.00267156,
2.36810608,
1.16995286,
1.1642712 ,
1.10859618,
1.06600503,
1.21508987,
1.0787819 ,
1.09805371,
1.02292747,
1.03231072,
1.18328352,
1.37353851,
1.22336554,
3.59724846,
1.24665954,
1.30544803,
2.44246581,
1.32471874,
1.00738069,
5.9882054 ,
1.40249714,
1.00407856,
1.4949384 ,
1.0824369 ,
2.15483025,
1.18316962,
1.20407141,

1.20700795,
1.31698135,
1.64502595,
1.00362929,
1.17024091,
1.35518787,
1.37073607,
1.07195997,
1.07709749,
1.02678228,
1.07216194,
1.41003297,
1.15569263,
1.21970862,
1.23495698,
1.90894829,
1.07505574,
1.03737332,
4.47329132,
1.08618807,
1.80683867,
1.06374104,
1.57406483,
2.95074706,
1.43707771,
1.45700787,
1.06277976,
1.28950268,
1.02389288,
1.19593147,
1.54384734,
1.17012639,
1.22785342,
2.11297625,
1.0155479 ,
1.56983927,
1.21030124,
1.02038832,
1.1496182 ,
1.37102479,
1.18815789,
1.40055227,
1.10496447,
1.46958707,
1.14785043,
1.3990104 ,
1.00010907,

In [25]:

1.3322435 ,
1.06591709,
1.42761931,
1.06130336,
1.04202372,
1.37502073,
1.30210462,
1.59030075,
1.15990004,
1.22877023,
1.03240683,
1.38282255,
5.29128002,
1.64624387,
1.18049123,
1.45474912,
1.00783053,
1.09468375,
1.07750658,
1.42731209,
1.50480417,
1.29594315,
1.36520632,
1.22134086,
1.29495065,
1.2638086 ,
1.28686286,
1.097493 ,
1.25194402,
1.0254639 ,
1.14500442,
1.38391157,
1.27631816,
1.26143227,
1.28175494,
1.28222527,
2.80102026,
1.25297709,
1.0789617 ,
1.09900305,
1.04979786,
1.00692291,
1.09051848,
1.25495488,
1.22485768,
1.39048945,
1.43713945,

1.00780962,
1.08570245,
1.06296594,
1.5301247 ,
1.13714536,
1.12400697,
2.42457801,
1.72792049,
1.68745525,
1.1892978 ,
1.18567753,
1.77681173,
1.02524465,
2.0412072 ,
1.22801025,
1.8144427 ,
1.00250381,
1.15216324,
1.03342563,
1.2146333 ,
1.18301316,
1.04074624,
2.56763039,
1.01645654,
1.32412538,
1.44317947,
1.47100989,
1.12396306,
1.08270657,
1.09026155,
1.06594394,
1.32727307,
1.06380674,
1.12505204,
1.27414843,
1.32176709,
1.07149333,
2.29099447,
1.38541987,
1.25483826,
1.00763163,
2.3895882 ,
1.21207706,
1.26449497,
1.10709807,
1.18194765,
1.04910661])

You might also like