Posts

Estimation

## Terminology - $\mu$: population mean (usually unknown) - $\bar x$ : sample mean - $\hat \mu$: estimator of mean - $\hat \sigma^2$: estimator of variance - $\theta_*$: true parameter (unknown) - Parameter bias - Parameter variance ## MAP estimation We think of $\theta$ as a random variable. $$ \begin{align*} \theta_{MAP} &= \arg\max_\theta p(\theta | d) \\ &= \arg\max_\theta p(d|\theta)p(\theta) \\ &= \arg\max_\theta \log p(d|\theta) + \log p(\theta) \end{align*} $$ Let's assume that our likelihood function is $$ p(x|\mu) = G_x[\mu,1], \quad p(\mu) = G_\mu[0,\alpha^2] $$ ## ML estimation We think of $\theta$ as a parameter, instead of a random variable. The log-likelihood of $p(x|\theta)=G(\mu,\sum)$ is given by $$ L = \frac 12 \sum_{i=1}^m \{ \log (\det [{\sum}^{-1}] ) - \frac 12(x^{(i)}-\mu)^T {\sum}^{-1} (x^{(i)}-\mu) \} $$ Derivative w.r.t. $\mu$: $$ \mu = \frac 1m \sum_{i=1}^m x^{(i)} $$ Derivative w.r.t. $\sum$: $$ \sum = \frac 1m \sum_{i=1}^m (x^{(i)}...

Convergence rate O(1/k) of gradient descent

Image
Preliminary kk-strongly convex Lipschitz w.r.t norm ||x||||x|| Gradient descent choose step size tkt_k convergence rate under Lipschitz gradient convergence rate under strong convexity Steepest descent Preliminary -strongly convex is 2-strongly convex. Roughly speaking, the faster a function grows (in either direction), the stronger convexity it has. Lipschitz w.r.t norm A function which satisfies Lipschitz condition cannot grow too fast, precisely, faster than linear. A function cannot be both Lipschitz and strongly convex. http://www.cs.cmu.edu/~ggordon/10725-F12/schedule.html Roughly speaking, if is Lipschitz, . If is Lipschitz, . Gradient descent Consider the expansion (linear approximation plus proximity term) choose step size Backtracking line search. fix with while , update . Exact line search. (usually not possible and it’s not worth it) convergence rate under Lipschitz gradient Assume that is convex and differentiable, ...