playground:playground

差分

このページの2つのバージョン間の差分を表示します。

この比較画面へのリンク

両方とも前のリビジョン 前のリビジョン
次のリビジョン
前のリビジョン
最新のリビジョン両方とも次のリビジョン
playground:playground [2021/08/22 14:19] Hideaki IIDUKAplayground:playground [2021/08/22 14:34] Hideaki IIDUKA
行 2: 行 2:
  
 ^ Convergence rates of stochastic optimization algorithms for convex and nonconvex optimization ^^^ ^ Convergence rates of stochastic optimization algorithms for convex and nonconvex optimization ^^^
-^ Algorithms ^ Convex Optimization ^ ^ Nonconvex Optimization ^+^ Algorithms ^ Convex Optimization ^ ^ Nonconvex Optimization 
 |            | Constant learning rate | Diminishing learning rate | Constant learning rate | Diminishing learning rate | |            | Constant learning rate | Diminishing learning rate | Constant learning rate | Diminishing learning rate |
 +| SGD \cite{sca2020}|$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$| $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ |
 +| SGD with SPS \cite{loizou2021}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$|
 +| Minibatch SGD \cite{chen2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$| 
 +| Adam \cite{adam}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$|---------|---------|
 +| AMSGrad \cite{reddi2018}|---------|$\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$|---------|---------|
 +| GWDC \cite{liang2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|---------|---------|
 +| AMSGWDC \cite{liang2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|---------|---------|
 +| AMSGrad \cite{chen2019}|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$|--------|$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$|
 +| AdaBelief \cite{adab}|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$|
 +| Proposed |$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$|
  
- +Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(\bm{x})\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Proposed can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$.  (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists.
-\multirow{2}{*}{} & \multicolumn{2}{l|}{Convex optimization} & \multicolumn{2}{l}{Nonconvex optimization} \\ \cline{2-5} +
- & Constant learning rate & Diminishing learning rate & Constant learning rate & Diminishing learning rate \\ \hline +
-SGD \cite{sca2020}  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ \\ +
-SGD with SPS \cite{loizou2021}  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ \\ +
-Minibatch SGD \cite{chen2020}  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ \\ \hline\hline  +
-Adam \cite{adam}  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$  +
-  +
-& ---------  +
-& --------- \\ +
-AMSGrad \cite{reddi2018}  +
-& --------- +
-& $\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$  +
-  +
-& ---------  +
-& --------- \\ +
-GWDC \cite{liang2020}  +
-& --------- +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$  +
-& ---------  +
-& --------- \\ +
-AMSGWDC \cite{liang2020}  +
-& --------- +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$  +
-& ---------  +
-& --------- \\ +
-AMSGrad \cite{chen2019}  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$  +
-  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ \\ +
-AdaBelief \cite{adab}  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$  +
-  +
-& ---------  +
-& $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ \\ +
-Algorithm \ref{algo:1} (presented herein)  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$  +
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ \\ +
-\hline  +
-\end{tabular}  +
- +
-Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(\bm{x})\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$.  (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists.+
 \end{table*} \end{table*}
  
  
  • playground/playground.txt
  • 最終更新: 2021/08/22 14:34
  • by Hideaki IIDUKA