Website - Machine Learning

Pattern recognition and machine learning - Christopher m.BISHOP http://jmlr.csail.mit.edu/papers/volume2/manevitz01a/manevitz01a.pdf http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/FreyReport.pdf http://www.cc.gatech.edu/~ssahay/sauravsahay7001-2.pdf http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf http://code.google.com/p/textclassification/source/browse/trunk/src/java/libsvm/ svm.java?r=13 http://code.google.com/p/textclassification/source/browse/wiki/HOWTO.wiki?
r=13 concepts: http://en.wikipedia.org/wiki/Bhattacharyya_distance http://en.wikipedia.org/wiki/Slack_variable manifold www.math.ucla.edu/~wittman/mani/ --> http://www.autonlab.org/tutorials/ (important) 1 svm: http://arxiv.org/PS_cache/arxiv/pdf/1107/1107.2347v1.pdf http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/SVM http://opencv.itseez.com/doc/tutorials/ml/introduction_to_svm/introduction_to_sv m.html http://www.cs.berkeley.edu/~bartlett/courses/281b-sp08/6.pdf http://www.support-vector-machines.org/SVM_soft.html http://research.cs.wisc.edu/dmi/lsvm/ http://www.statsoft.com/textbook/support-vector-machines/ http://www.support-vector.net/icml-tutorial.pdf http://research.microsoft.com/en-us/um/people/oferd/papers/dekelsi07.pdf http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf https://sites.google.com/site/postechdm/research/implementation/svm-java http://fedc.wiwi.hu-berlin.de/xplore/tutorials/stfhtmlnode64.html http://jmlr.csail.mit.edu/papers/volume1/mangasarian01a/html/node3.html http://mlg.eng.cam.ac.uk/porbanz/teaching/sheet_ml__svm.pdf http://www.ai.mit.edu/courses/6.867-f04/lectures/lecture-7-ho.pdf http://www.yom-tov.info/Uploads/SVM.m http://www.cs.cmu.edu/~ggordon/SVMs/svm.m
PLSI ->http://ezcodesample.com/plsaidiots/NMFPLSA.html http://people.csail.mit.edu/fergus/iccv2005/bagwords.html http://videolectures.net/mlss09uk_blei_tm/ ----http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm Paper: finding solution path : author: Hastie Multiclass classification String kernel ,diffusion kernel, Fisher? http://bioinformatics.oxfordjournals.org/content/17/4/349.full.pdf+html http://en.wikipedia.org/wiki/Positive-definite_matrix Original : we need to calculate w ( based on knowing alpha) Kernel points: (1) no need to calculate w ( based on knowing alpha)
Derive : Structural loss function , 2nd slide some formula --------------------why sigmoid for bernoulli bernoulli -> Tossing a coin if book or else? 1) for P(y x;thetha) case 1: y element of R , gaussian -> least square case 2: y= {0,1} -> bernoulli -> logistic regression (maximum likelihood) softmax regression : just k =2 ;generalization of logistic regression 2) Kernel or mercer kernel --> look into source code of svm_light (C:0.05) (author: thorsten) --> libSVM SVD - Singular Vector Decomposition PCA: objective function with constrain ( probabiltity = 1) Bow, and tf-idf does not link related documents having different words of same m eaning,so we go for SVD (1) maximum likelihood (3) posterior probability # sum( n(d,w) log(sum(P(w z) P(d z) P(z)) why p(w z) -> w and d are independent , so we have w and z HW: Derive newton raphson method ? What is gibb's distributon? What is sparse solution in your probaility distribution model? L1 can be used for finding sparse solution. SVM 1 norm : 1/C and w are piecewise linear. Penalty is more :so not smoother while choosing solution path for objective fu nction. for C=1, square and circle (formed at end point of training data line plot) me et point SVM 2 norm : Penalty is higher bcs of square , hence not smoother while choosing solution path for objective fun ction. circle is two norm(formed at end point of training data line plot ) (for c =1) generative model: -> P(w, d) = Integral (P(w,d theta) or discriminative learning -> P(y x) -----------Sometimes supervised to unsupervised can be done .-> Restricted boltzmen machi
ne or RBM this is also mixute model. Convex: first derivative - Gradient second derivative - hessian ML -> descent direction position -> various ways to find the descent, <iteration> <-> convergence step_length <=> minimal step (between 0 and 1), it is favourable second-order necessary condition : if sin or cos , or negative or zero, then nec essary conditions Second-order sufficent condition : if hessian matrix is non-negative then posit ive definite. and better non-zero. if objec is non-convex, then still want find solution , make direction as -dire ction. once we got gradient matrix, then substitue in obj fn,then we get direction wolfe condition : has condition(1) and condition(2) (1) step length satisifed or not ? (2) line search satisfied or not ? Zoutendijk s theorem: to check gradient is smooth or not? x(tilta) = x(i+1) class 1: cross-validation: find lamda for regularization: the maximum f1 corresponding value is lamda. lamda (like C in SVM) = 0.05 P(x,y) = p(y,x) -> joint probability Bayes theorem - > denominator removed or say constant, because p(y=1 x) or p( y=-1 x) has same constant, hence cancels , denominator not needed for classification Expectation -> averaging of probability ; f(x) say weight of ball to draw (1) given obj functio n (2) find gradient descent , take 2nd partial derivate and equate to zero. newton raphon (quadratic) converges much faster than gradient , MAX: max posterier probabilty :approach for max objective funciton (from baye s rule) (1) consider prior (2) only consider conditional probability(maxim likelihood) (3) denominator , Entopy: difference between two probability distributions multivariant , gaussian , diriclet to concentrate Conjugative:
diriclet <-> multinominal gaussian <-> gaussian --> RBM (DEEP LEARNING )and RNN similar since similar energy function. Beta Distribution one time tossing - bernoulli many times tossing- > binominal people coming to class for a class -> gaussian , most events in life is gaussia n overfitting: penalty : lamda/2 * w either make w small or lamda more.. N(dw)/ N(d) we want to smooth this, if we go with 1, may not work all time, so reason for goign to prior distribution or direlect distribution binominal is one param, beta disribution 2 param Conjugate prior : same distribution laplace distribution - > has sparse effect ? what is sparse ->force the vector to have less non-zero values with conjugate prior easy to solve many objective function.. plsi - dirichlet --> read gaussian , read topic modeling, how to avoid over fitting ? --------------------------prior is to avoid over fitting or smoothing multivariant gaussian : ---------------------nheu - covariance matrix inversion matrix -(called) precision matrix covariance controls the angle of matrix ,and hence contour 1) Maximum likelihood 2) bayesian (we have prior , and that is the only diff).. http://en.wikipedia.org/wiki/Robbins-Monro_procedure http://en.wikipedia.org/wiki/Student's_t-distribution -----slide 1) compare mapping , provided feature space same.
www.cse.st.hk/TL/index.html bincao@microsoft.com ------------Topic:: Direct /undirected and graph model Inference on topic model Sparse: no structure (between link information of words ), only instance inform ation Sigmoid or gaussian distribution : convert real value to value between 0 and 1. graph cut: maximize conditional probability http://en.wikipedia.org/wiki/Conditional_random_field conditional random field:: (1/z) sum( E(y(i) G(i),X(i)) (1/z is normalizing ) label bias: ? factor graph: no direct link between 2 variables, but there is a function that links both (1) marginal probabilt (2) conditional probabilty are two concepts used in fac tor graph loopy belief propagation -> (no exact inference) just between nodes pass messag e -----------------------------------gaussian mean - gaussian gaussian variance - gamma binominal - beta multinominal - diriliect http://reality.media.mit.edu/ Normalizaition is hard: say n node, then we have 2PowerN. http://en.wikipedia.org/wiki/Conditional_random_field http://en.wikipedia.org/wiki/Viterbi_algorithm http://en.wikipedia.org/wiki/Forward-backward_algorithm http://en.wikipedia.org/wiki/Markov_random_field To get bench mark data https://www.mturk.com/mturk/welcome maintain dictionary to hold keywords like traffic n see how it influences GT: red, green, yellow, Prediction: street -> % of traffic p(w,street) = p(z). p(w,z) p(st,z)
qinghua dong mer, yi fir ,

Website - Machine Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Website - Machine Learning

Uploaded by

Copyright:

Available Formats

qinghua dong mer, yi fir ,

You might also like