Implicit Generation and Modeling with Energy-Based Models
May 30, 2024
Notes on: Implicit Generation and Modeling with Energy-Based Models
Contributions
- Escalable algorithms for EBM
- Show compositionality, decorruption and inpainting
- Useful for: out of distribution classification, adversarially robust class, multi-step trajectory prediction and online learning
Definitions
Energy is represented as a composition of latent and observable variables
Lavengin Dynamics is a gradient based MCMC
Energy-Based Models Sampling
\(X \in \mathbb{R}^D\) for \(D\) the dimension of original data, \(E_{\theta}(x) \in \mathbb{R}\) with \(E_{\theta}: \mathbb{R}^D \rightarrow \mathbb{R}\) a function parametrized by trainable parameters $\theta$, as a neural network
\[P_{\theta}(x) = \frac{\exp (- E_{\theta}(x))}{Z(\theta)} \\ Z(\theta)=\int \exp(-E_{\theta}(x)) dx\]$P_{\theta}$ is hard to sample. Lavengin Dynamics is used for sampling efficiently
\[\tilde{x}^k= \tilde{x}^{k-1}-\frac{\lambda}{2}\nabla_xE_{\theta}(\tilde{x}^{k-1})+w^k;w^k \sim N(0, \lambda) \\ \tilde{x}^k \sim q_\theta; k \rightarrow \infty;\lambda \rightarrow 0 \Rightarrow q_\theta \rightarrow p_\theta\]Where \(E_{\theta}\) has a “deconvolution” architecture, in this case is implemented as the gradient of the architecture.
Maximum likelihood training
\[\min_\theta L_{ML}(\theta)=\langle - \log P_\theta(x) \rangle_{P_D} \\ L_{ML}(\theta)= E_\theta(x) - \log(Z(\theta))\\ L_{ML}(\theta)= \langle \nabla_\theta E_\theta(x^+) \rangle_{x^+ \sim P_D} - \langle \nabla_\theta E_\theta (x^-) \rangle_{x^- \sim P_\theta}\]Lack of gradient is important, it controls diversity on likelihood and model collapse
Sample Replay Buffer: Saves generated \(\tilde{x}\) to initialize MCMC Lavengin dynamics, with 95% from Buffer and 5% with noise
Pseudo-Algorithm
Energy training algorithm
- Input: Data \(P_D(x)\), Stepsize: \(\lambda\), number of steps \(K\)
- \[B \leftarrow \{\}\]
- While not converge
- sample \(x^+_i \sim P_D\)
- sample \(x^0_i \sim B\) with 95% and \(x^0_i \sim U\) 5%
- for \(k=1:k\) : (Generate samples with Langevin dynamics)
- \[\tilde{x}^k \leftarrow \tilde{x}^{k-1} - \nabla_\theta E_\theta (\tilde{x}^{k-1} ) + w ; w \sim N(0,\sigma)\]
- \(x^-_i=\Omega(\tilde{x_i^k})\) #no grad
- \(\Delta \theta \leftarrow \nabla_\theta \frac{1}{N}\sum\alpha(E_\theta(x^+_i)^2+E_\theta(x^-_i)^2)+ E_\theta(x^+_i)-E_\theta(x^-_i) \iff \min_\theta (\alpha L_2 + L_{ML})\) (Optimize Objective)
- Update $\theta$ with \(\Delta \theta\) using an optimizer
- \[B\leftarrow B \cup \tilde{x}_i\]