文書の過去の版を表示しています。
PlayGround
Convergence rates of stochastic optimization algorithms for convex and nonconvex optimization | ||||
---|---|---|---|---|
Algorithms | Convex Optimization | Nonconvex Optimization | ||
Constant learning rate | Diminishing learning rate | Constant learning rate | Diminishing learning rate | |
SGD \cite{sca2020} | $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$ | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ |
SGD with SPS \cite{loizou2021} | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$ | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ |
Minibatch SGD \cite{chen2020} | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$ | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ |
Adam \cite{adam} | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$ | ——— | ——— |
AMSGrad \cite{reddi2018} | ——— | $\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$ | ——— | ——— |
GWDC \cite{liang2020} | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | ——— | ——— |
AMSGWDC \cite{liang2020} | ——— | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | ——— | ——— |
AMSGrad \cite{chen2019} | ——— | $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$ | ——– | $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ |
AdaBelief \cite{adab} | ——— | $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$ | ——— | $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ |
Proposed | $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$ | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$ | $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ |
Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(\bm{x})\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Proposed can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$. (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists. \end{table*}