Add a note in the docs about the momentum formulation used in optim

I have been looking at the implementation of SGD + Momentum in PyTorch and noticed something a bit different from how other packages (and papers) describe it. For the moment, let's focus solely on (classical) momentum and not Nesterov's version.  

At the time of writing, the implementation reads:

 ```
               if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = d_p.clone()
                    else:
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(1 - dampening, d_p)
                    if nesterov:
                        d_p = d_p.add(momentum, buf)
                    else:
                        d_p = buf

                p.data.add_(-group['lr'], d_p)
``` 

Mathematically, if we denote the momentum buffer by `v` and assume that `dampening=0`, at every iteration, the buffer is updated as `v = m*v + g` and the step is `∆x = lr * v`. Notice that the learning rate `lr` hits the momentum term `v` as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.

Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.

## [Sutskever et. al.](http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf)
The snippet of the relevant section is pasted below. 
![Sutskever et. al.](http://i.imgur.com/QJelodE.png)

Retaining the syntax from above, the algorithm updates `v` as `v = m*v - lr * g` with the step `∆x = v`. So, the learning rate `lr` only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation. 

# [Lasagne](https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L217)

Lasagne employs the same rule as suggested in Sutskever for momentum. 

```
    for param in params:
        value = param.get_value(borrow=True)
        velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        x = momentum * velocity + updates[param]
        updates[velocity] = x - param
```

# [Keras](https://github.com/fchollet/keras/blob/master/keras/optimizers.py#L141)

Same for Keras:
```
       for p, g, m in zip(params, grads, moments):
            v = self.momentum * m - lr * g  # velocity
            self.updates.append(K.update(m, v))

            if self.nesterov:
                new_p = p + self.momentum * v - lr * g
            else:
                new_p = p + v
```

# [Neon](https://github.com/NervanaSystems/neon/blob/master/neon/optimizers/optimizer.py#L520)

and Neon.
```
                velocity[:] = self.momentum_coef * velocity - lrate * grad

                # Nesterov accelerated gradient (NAG) is implemented the same
                # as in torch's "sgd.lua". It's a reformulation of Sutskever's
                # NAG equation found in "On the importance of initialization
                # and momentum in deep learning".
                if self.nesterov:
                    param[:] = param + self.momentum_coef * velocity -\
                               lrate * grad
                else:
                    param[:] = param + velocity
```
Is the disparity true or am I missing something important?

The difference between the two implementations is not insignificant and especially so when `lr` is reduced along the way. If my claim is true, maybe we could update the reference (I'm not sure what that would be) or include the above version in the SGD code (I can take this up if necessary)? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a note in the docs about the momentum formulation used in optim #1099

Sutskever et. al.

Lasagne

Keras

Neon

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Add a note in the docs about the momentum formulation used in optim #1099

Description

Sutskever et. al.

Lasagne

Keras

Neon

Activity

colesbury commented on Mar 25, 2017

keskarnitish commented on Mar 25, 2017

soumith commented on Apr 5, 2017

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions