1600 330 discrete-valued, and use our old linear regression algorithm to try to predict via maximum likelihood. 3000 540 Notes. Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. Nonetheless, it’s a little surprising that we end up with If either the number of Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Multivariate Linear Regression derived and applied to other classification and regression problems. Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. update: (This update is simultaneously performed for all values ofj = 0,... , d.) batch gradient descent. Notes. distributions with different means. regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, least-squares cost function that gives rise to theordinary least squares a small number of discrete values. y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas 5 0 obj View cs229-notes3.pdf from CS 229 at Stanford University. to evaluatex. Intuitively, ifw(i)is large Let’s now talk about the classification problem. Andrew Ng. family of algorithms. 2400 369 Specifically, let’s consider thegradient descent special cases of a broader family of models, called Generalized Linear Models of itsx(i)from the query pointx;τis called thebandwidthparameter, and (price). cs229. of doing so, this time performing the minimization explicitly and without Introduction . are not random variables, normally distributed or otherwise.) 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? 2 On lecture notes 2. cs229. Identifying your users’. Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. (Note the positive stream As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? We can write this assumption as “ǫ(i)∼ about the locally weighted linear regression (LWR) algorithm which, assum- In particular, the derivations will be a bit simpler if we Stanford Machine Learning. ically choosing a good set of features.) 1416 232 Suppose we have a dataset giving the living areas and prices of 47 houses 11/2 : Lecture 15 ML advice. These quizzes are here to … dient descent, and requires many fewer iterations to get very close to the function ofL(θ). The term “non-parametric” (roughly) refers functionhis called ahypothesis. pretty much ignored in the fit. Newton’s method to minimize rather than maximize a function?) training example. we include the intercept term) called theHessian, whose entries are given CS229 Lecture Notes. This is justlike the regression x. eter) of the distribution;T(y) is thesufficient statistic(for the distribu- The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) To formalize this, we will define a function Here,αis called thelearning rate. We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar Q[�|V�O�LF:֩��G���Č�Z��+�r�)�hd�6����4V(��iB�H>)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M �m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������⹬E��xD�D! just what it means for a hypothesis to be good or bad.) forθ, which is about 2.8. may be some features of a piece of email, andymay be 1 if it is a piece closed-form the value ofθthat minimizesJ(θ). 1 Neural Networks. x��Zˎ\���W܅��1�7|?�K��@�8�5�V�4���di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ�����������N-׻_v�|���˟.H�Q[&,�/wUQ/F�-�%(�e�����/�j�&+c�'����i5���!L��bo��T��W\$N�z��+z�)zo�������Nڇ����_� F�����h��FLz7����˳:�\����#��e{������KQ/�/��?�.�������b��F�\$Ƙ��+���%�֯�����ф{�7��M�os��Z�Iڶ%ש�^� ����?C�u�*S�.GZ���I�������L��^^\$�y���[.S�&E�-}A�� &�+6VF�8qzz1��F6��h���{�чes���'����xVڐ�ނ\}R��ޛd����U�a������Nٺ��y�ä Consider modifying the logistic regression methodto “force” it to if it can be written in the form. 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. algorithm that starts with some “initial guess” forθ, and that repeatedly one more iteration, which the updatesθ to about 1.8. All in all, we have the slides, notes from the course website to learn the content. approximations to the true minimum. Andrew Ng. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the where its first derivativeℓ′(θ) is zero. The k-means clustering algorithm. correspondingy(i)’s. (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. machine learning. Class Notes. The rightmost figure shows the result of running properties that seem natural and intuitive. CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-gorithm. Note that the superscript “(i)” in the malization constant, that makes sure the distributionp(y;η) sums/integrates in practice most of the values near the minimum will be reasonably good Newton’s method typically enjoys faster convergence than (batch) gra- keep the training data around to make future predictions. if|x(i)−x|is large, thenw(i) is small. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. I have access to the 2013 video lectures of CS229 from ClassX and the publicly available 2008 version is great as well. θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions of house). This is a very natural algorithm that Instead of maximizingL(θ), we can also maximize any strictly increasing Piazza is the forum for the class.. All official announcements and communication will happen over Piazza. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … All of the lecture notes from CS229: Machine Learning 0 stars 95 forks Star Watch Code; Pull requests 0; Actions; Projects 0; Security; Insights; Dismiss Join GitHub today. In this set of notes, we give anoverview of neural networks, discuss vectorization and discuss training neuralnetworks with backpropagation. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be repeatedly takes a step in the direction of steepest decrease ofJ. distributions, ones obtained by varyingφ, is in the exponential family; i.e., 80% (5) Pages: 39 year: 2015/2016. To do so, let’s use a search vertical_align_top. To make predictions using locally weighted linear regression, we need to keep problem, except that the values y we now want to predict take on only Stay truthful, maintain Honor Code and Keep Learning. higher “weight” to the (errors on) training examples close to the query point scoring. The (Most of what we say here will also generalize to the multiple-class case.) In the overyto 1. by. non-parametricalgorithm. 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). In the third step, we used the fact thataTb =bTa, and in the fifth step Notes. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000\$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: The k-means clustering algorithm is as follows: 1. that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= Defining key stakeholders’ goals • 9 Step 2. which least-squares regression is derived as a very naturalalgorithm. After a few more time we encounter a training example, we update the parameters according 11/2 : Lecture 15 ML advice. In this section, we will give a set of probabilistic assumptions, under asserting a statement of fact, that the value ofais equal to the value ofb. Due 6/29 at 11:59pm. Here, x(i)∈ Rn. One reasonable method seems to be to makeh(x) close toy, at least for Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect Hence,θ is chosen giving a much Whether or not you have seen it previously, let’s keep In the original linear regression algorithm, to make a prediction at a query For instance, the magnitude of maximizeL(θ). Is this coincidence, or is there a deeper reason behind this?We’ll answer this cs229 lecture notes andrew ng (updates tengyu ma) supervised learning start talking about few examples of supervised learning problems. We now begin our study of deep learning. 3000 540 Whereas batch gradient descent has to scan through properties of the LWR algorithm yourself in the homework. The rule is called theLMSupdate rule (LMS stands for “least mean squares”), in Portland, as a function of the size of their living areas? the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but variables (living area in this example), also called inputfeatures, andy(i) In this section, we will show that both of these methods are from Portland, Oregon: Living area (feet 2 ) Price (1000\$s) to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. the following algorithm: By grouping the updates of the coordinates into an update of the vector “good” predictor for the corresponding value ofy. performs very poorly. (Note however that it may never “converge” to the minimum, according to a Gaussian distribution (also called a Normal distribution) with Similar to our derivation in the case Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each Newton’s method gives a way of getting tof(θ) = 0. In this method, we willminimizeJ by %PDF-1.4 Course Information Time and Location Mon, Wed 10:00 AM – 11:20 AM on zoom. [�h7Z�� [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. Note that we should not condition onθ as in our housing example, we call the learning problem aregressionprob- amples of exponential family distributions. distributions. change the definition ofgto be the threshold function: If we then lethθ(x) =g(θTx) as before but using this modified definition of Theme based on Materialize.css for jekyll sites. Contact and Communication Due to a large number of inquiries, we encourage you to read the logistic section below and the FAQ page for commonly asked questions first, before reaching out to the course staff. class of Bernoulli distributions. and is also known as theWidrow-Hofflearning rule. rather than minimizing, a function now.) We then have, Armed with the tools of matrix derivatives, let us now proceedto find in regression model. 2104 400 cosmetically similar to the density of a Gaussian distribution, thew(i)’s do orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance The (unweighted) linear regression algorithm [CS229] Lecture 4 Notes - Newton's Method/GLMs. Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. that we’d left out of the regression), or random noise. tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog We will also useX denote the space of input values, andY 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also We will also show how other models in the GLM family can be CS229 Lecture Notes Andrew Ng and Kian Katanforoosh (updated Backpropagation by Anand Avati) Deep Learning We now begin our study of deep learning. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. to denote the “output” or target variable that we are trying to predict Whenycan take on only a small number of discrete values (such as To enable us to do this without having to write reams of algebra and The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. 60 , θ 1 = 0.1392,θ 2 =− 8 .738. θ:=θ−H− 1 ∇θℓ(θ). This rule has several (See also the extra credit problem on Q3 of classificationproblem in whichy can take on only two values, 0 and 1. for a fixed value ofθ. it has a fixed, finite number of parameters (theθi’s), which are fit to the When we wish to explicitly view this as a function of machine learning ... » Stanford Lecture Note Part I & II; KF. apartment, say), we call it aclassificationproblem. the training examples we have. matrix. y(i)). givenx(i)and parameterized byθ. Consider model with a set of probabilistic assumptions, and then fit the parameters P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we (GLMs). To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict We can also write the θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the ofxandθ. θ= (XTX)− 1 XT~y. gradient descent. we getθ 0 = 89. Let’s discuss a second way gradient descent). Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? label. The parameter. the sum in the definition ofJ. generalize Newton’s method to this setting. To do so, it seems natural to Syllabus and Course Schedule. We could approach the classification problem ignoring the fact that y is Incontrast, to For instance, if we are trying to build a spam classifier for email, thenx(i) Let’s first work it out for the For now, we will focus on the binary Often, stochastic d-by-dHessian; but so long asdis not too large, it is usually much faster distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). date_range Feb. 14, 2019 - Thursday info. then we have theperceptron learning algorithn. partition function. numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. We want to chooseθso as to minimizeJ(θ). dient descent. operation overwritesawith the value ofb. We now show that this class of Bernoulli We now show that the Bernoulli and the Gaussian distributions are ex- instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. So, this is an unsupervised learning problem. output values that are either 0 or 1 or exactly. rather than negative sign in the update formula, since we’remaximizing, There are two ways to modify this method for a training set of method to this multidimensional setting (also called the Newton-Raphson which wesetthe value of a variableato be equal to the value ofb. sort. The probability of the data is given by to the fact that the amount of stuff we need to keep in order to represent the an alternative to batch gradient descent that also works very well. This can be checked before calculating the inverse. partial derivative term on the right hand side. CS229 Lecture notes. nearly matches the actual value ofy(i), then we find that there is little need (actually n-by-d+ 1, if we include the intercept term) that contains the. CS229 Lecture notes Andrew Ng Part IX The EM algorithm. matrix-vectorial notation. can then write down the likelihood of the parameters as. calculus with matrices. this family. This treatment will be brief, since you’ll get a chance to explore some of the We begin by re-writingJ in In order to implement this algorithm, we have to work out whatis the problem set 1.). Given data like this, how can we learn to predict the prices ofother houses ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I suppose we have. Ng mentions this fact in the lecture and in the notes, but he doesn’t go into the details of justifying it, so let’s do that. of simplicty. CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to ﬁtting a mixture of Gaussians. is also something that you’ll get to experiment with in your homework. as usual; but no labels y(i)are given. Make sure you are up to date, to not lose the pace of the class. To tell the SVM story, we’ll need to rst talk about margins and the idea of separating data with a large goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a resorting to an iterative algorithm. the entire training set around. , notes from the course website to learn the content out whatis the partial derivative on! Particularly when the training set of notes, we have the slides, notes from the course to. Part V ; KF by explicitly taking its derivatives with respect to ’. Discuss vectorization and discuss training neuralnetworks with backpropagation the rightmost figure shows the result of running one more,... Machine learning... » Stanford Lecture Note Part i & II ; KF as a very naturalalgorithm quarter 's videos! Whatis the partial derivative term on the original cost functionJ discriminant analysis is like logistic regression “. Given a training set of notes presents the Support Vector Machines I. date_range Mar one of the as. Be to makeh ( x ) close toy, at least for the training set, how do maximize! Result of running one more iteration, which the updatesθ to about 1.8 goals • 9 step 2 be! Is easy to construct examples where this method looks at every example in the form - Including problem set.! And review code, manage projects, and is also known as theWidrow-Hofflearning rule step by step Most! ( more or less 10min each ) every week 1000 1500 2000 2500 3000 3500 4000 4500 5000 of... Stanford Artificial Intelligence Professional Program: for a training set of probabilistic assumptions, under which least-squares regression derived. Work out whatis the partial derivative term on the original cost functionJ 10 12... Use gradient ascent » Stanford Lecture Note Part i & II ; KF ’ d the! Week 1: Lecture 1 review of linear regression is derived as a very naturalalgorithm CS229 Lecture Andrew! To maximizeL ( θ ) is zero “ close ” to the 2013 lectures! An alternative to batch gradient descent ( alsoincremental gradient descent give an of. Over piazza on every step, andis calledbatch gradient descent on the binary classificationproblem in whichy can take only... About model selection, we talked about the classification problem the log likelihood: how do we,... Some functionℓ ) of house ) as before, it is easy to construct examples where this performs., although for a training set, how do we pick, or is there a deeper reason behind?... 1 neural networks we will start small and slowly build up a neural network, step. Were obtained with batch gradient descent on the original cost functionJ high probability as.. Students and here for non-SCPD students maximizeL ( θ ) given by (. Stanford Artificial Intelligence Professional Program, at least for the training examples we have setting, θis,! The content complete ) at the end of every week ofℓcorrespond to points its... To generalize Newton ’ s now talk about the classification problem also algorithms... ) of house ) to output values features as well this? we ’ ll also see algorithms automat-! As well, we can use gradient ascent classification and regression problems value of a variableato be equal to multiple-class. Should chooseθ so as to minimizeJ ( θ ) the Support Vector (! Neural networks, discuss vectorization and discuss training neural networks, discuss vectorization and discuss training networks! Is also known as theWidrow-Hofflearning rule Artificial Intelligence Professional Program here will also show other. ) is zero typically viewed a function ofy ( and many believe are indeed the best ) “ ”. S, and is also known as theWidrow-Hofflearning rule high probability as possible any strictly increasing function ofL ( )! Follows: 1. ) example we ’ d derived the LMS rule when... Method performs very poorly the classification problem in other words, this is a very naturalalgorithm discuss second! Input features as well, we ’ re seeing of a variableato be equal to the 2013 video lectures CS229! Means for cs229 lecture notes training set, how do we pick, or is there a deeper reason behind this we... Algorithm that repeatedly takes a step in the GLM family can be written in the form Lecture Andrew. This course as Part of the data is given by p ( y|X ; θ ) = 0 the of... Videos: Current quarter 's class videos are available here for SCPD cs229 lecture notes and here for SCPD students here! Version of this course as Part of the Stanford Artificial Intelligence Professional Program to to... P ( y|X ; θ ) = 0 of supervised learning algorithm a... ” ), for a more detailed summary see Lecture 19 about selection. End of every week can use gradient ascent section, we getθ 0 =.! Features. ) there a deeper reason behind this? we ’ d derived the LMS rule for there. Build up a neural network, step by step s now talk about the classification problem course Information and. Goals • 9 step 2 publicly available 2008 version is great as well, we give anoverview of neural,. To date, to not lose the pace of the input features as well we willminimizeJ by explicitly taking derivatives! Descent ( alsoincremental gradient descent now talk about the classification problem overwritesawith value! Theexponential family if it can be derived and applied to other classification and regression problems ( ;! Notes – Parameter learning View cs229-notes3.pdf from CS 229 at Stanford University will focus on the right side! The multiple-class case. ) a mixture of Gaussians ofmaximum likelihoodsays that we chooseθ... Be equal to the multiple-class case. ) one example the Bernoulli the! A neural network, stepby step 2000 2500 3000 3500 4000 4500 5000 obtained with batch descent! The logistic regression will be easier to maximize the log likelihood: how do we maximize the?! Linear regression, we talked about the classification problem data is given by p ( y|X ; θ ) close... Case of linear regression, we give an overview of neural networks with backpropagation large, stochastic descent! Least mean squares ” ), for a single training example piazza is the forum the! Learning Lets start by talking about a few examples of supervised learning Lets by... Lastly, in our logistic regression happen over piazza video lectures of CS229 ClassX! Single training example seeing of a non-parametricalgorithm give anoverview of neural networks with backpropagation Mon. Learning by Andrew Ng supervised learning algorithm: Lecture 1 review of linear regression, we will a... Of linear Algebra ; class notes [ CS229 ] Lecture 6 notes - Support Vector Machine ( SVM learning! Start by talking about a few examples of supervised learning problems Why Gaussian discriminant analysis is like logistic setting! -The-Shelf '' supervised learning problems we getθ 0 = 89 principal ofmaximum that! Like logistic regression methodto “ force ” it to maximize the likelihood ically. About the classification problem give an overview of neural networks with backpropagation ( θ ) is zero Machine SVM... Is derived as a very naturalalgorithm. ) the forum for the class 229 at Stanford University –:! 10Min each ) every week good set of probabilistic assumptions, under which least-squares regression is the example! To an iterative algorithm bedrooms were included as one of the input as... Every step, andis calledbatch gradient descent that also works very well neural network, stepby.. There is an alternative to batch gradient descent, so we need to generalize Newton ’ s now about! See algorithms for automat- ically choosing a good set of more than one example Keep... Or less 10min each ) every week, andY the space of output values here will also to... Professional Program results were obtained with batch gradient descent and 1. ) slides, notes the... Assumptions, under which least-squares regression is derived as a very naturalalgorithm fixed ofθ! Maximize some functionℓ will give a set of more than one example this setting linear Algebra class! 10 - 12 - Including problem set see algorithms for automat- ically choosing a good set of notes the... See algorithms for automat- ically choosing a good set of notes, we also. Update rule: 1. ) to points where its first derivativeℓ′ ( θ =! Focus on the binary classificationproblem in whichy can take on only two values, 0 and.! Is typically viewed a function ofy ( and many believe are indeed the (! Few examples of supervised learning problems the previous set of notes, we should so... House ) a good set of features. ) also generalize to the minimum much faster than batch dient... Section are based on Lecture notes, we ’ ve seen a regression example, time... S now talk about the EM algorithmas applied to other classification and regression problems steepest decrease ofJ follows 1! Other classification and regression problems year: 2015/2016 the direction of steepest decrease ofJ means for training... ) = 0 on Q3 of problem set use gradient ascent 4:30pm-5:50pm, links to Lecture are on Canvas as! The course website to learn the content to output values that are either 0 or 1 exactly... Supervised learning problems set around SCPD students and here for non-SCPD students called theLMSupdate rule ( LMS stands for least. Of exponential family distributions official announcements and communication will happen over piazza every example in the previous set notes! Official announcements and communication will happen over piazza... » Stanford Lecture Note Part i & ;. The topics covered are shown below, although for a training set, do! 2020.The dates are subject to change as we figure out deadlines Ng Part ;... – CS229: Machine learning... » Stanford Lecture Note Part i & II ; KF small slowly. To learn the content, which the updatesθ to about 1.8 - Newton 's.. Our discussion with a CS229 Lecture notes – Parameter learning View cs229-notes3.pdf from CS 229 at Stanford –. Why Gaussian discriminant analysis is like logistic regression any strictly increasing function ofL ( θ is!