-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Autograd refactor #1016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autograd refactor #1016
Conversation
92b1e02
to
940fe73
Compare
d5ef9b9
to
10b7285
Compare
torch/autograd/__init__.py
Outdated
attribute. | ||
|
||
Arguments: | ||
variables (sequence of Variable): outputs of the differentiated function. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/autograd/__init__.py
Outdated
accumulated. | ||
retain_variables (bool, optional): If True, buffers necessary for | ||
computing the gradients won't be freed after use. It is only | ||
necessary to speicfy True if you want to differentiate any subgraph |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/autograd/__init__.py
Outdated
else Variable(var, volatile=True) | ||
for var in grad_outputs) | ||
return Variable._execution_engine.run_backward( | ||
tuple(outputs), tuple(grad_outputs), retain_variables, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
||
|
||
# TODO: how to do NoGrad in new style |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
if repeat == 1: | ||
continue | ||
grad_input = sum(grad_input.chunk(repeat, dim)) | ||
return grad_input | ||
return grad_input, None | ||
|
||
|
||
class Cumsum(Function): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/autograd/python_hook.h
Outdated
@@ -11,7 +11,7 @@ struct PyFunctionPreHook : public FunctionPreHook { | |||
~PyFunctionPreHook(); | |||
variable_list operator()(const variable_list& grads) override; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
||
THPObjectPtr pyInputs = PyTuple_New(inputs.size()); | ||
if (!pyInputs) throw python_error(); | ||
auto num_inputs = inputs.size(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
// Returning too many results is ok, but only as long as they're all None | ||
if (num_outputs > num_forward_inputs) { | ||
bool all_none = true; | ||
for (int i = num_outputs; i < num_forward_inputs; i++) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/autograd/input_buffer.h
Outdated
#pragma once | ||
|
||
// The InputBuffer class accumulates a list of Variables for use by a | ||
// function. It implements logic to avoid modiyfing the passed |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
}; | ||
|
||
auto Clone::apply(const variable_list& inputs) -> variable_list { | ||
if (inputs.size() != 1) throw std::runtime_error("Add expects exactly 2 inputs"); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/autograd/engine.cpp
Outdated
auto& fn = *task.fn; | ||
auto inputs = call_pre_hooks(fn, InputBuffer::variables(std::move(task.inputs))); | ||
|
||
auto& function_callbacks = task.base->function_callbacks; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
some context for the refactor, see https://gist.github.com/apaszke/a8bc5f167ca4c0f3a830a23e296d3daf |
487c673
to
f525284
Compare
It's been renamed to |
@apaszke if I use the version from http://pytorch.org/ do I get the same version as master? For all 3 - previous_functions, next_functions and grad_fn I am getting 'ConvNdBackward' object has no attribute 'x' |
@rishab96 the version from http://pytorch.org is not the same version as master. It's from the v0.1.12 branch. |
to get the master branch, you have to compile from source according to instructions in README.md |
Hi @apaszke, I was using the adagrad branch; the problem seems to be solved in master. Thanks for the replies. |
Im trying to implement WGAN-GP using the My code is here: |
I'm also trying to implemented wGAN with gradient penalty. I've taken a slightly different approach than @HiiYL. # calculate `x_hat` where `x` is a batch of real images and `x_z` is a batch of generated images.
epsilon = torch.randn(batch_size, 1, 1, 1)
x_hat = epsilon.expand_as(x.data) * x.data + (1.0 - epsilon.expand_as(x_z.data)) * x_z.data
x_hat = autograd.Variable(x_hat).detach()
o = discriminator.forward(x_hat)
gradients = autograd.grad(o, x_hat)
gradient_penalty = autograd.Variable(10.0 * torch.pow(torch.norm(gradients, p=2) - 1.0, 2.0))
gradient_penalty.backward() and |
@apaszke helped me code some minimal examples to compute second-order derivatives. Perhaps helpful to @HiiYL @mjdietzx: from torch.autograd import Variable, grad
import torch
x = Variable(torch.ones(1), requires_grad=True)
y = x.pow(3)
g = grad(y, x, create_graph=True)
print(g) # g = 3
g2 = grad(g, x)
print(g2) # g2 = 6 To implement the gradient penalty in WGAN: import torch
from torch.autograd import Variable, grad
torch.manual_seed(0)
net = torch.nn.Linear(10,1)
mse = torch.nn.MSELoss()
x = Variable(torch.randn(128, 10), requires_grad=True)
y = Variable(torch.randn(128, 1))
# your normal loss computation goes here
net.zero_grad()
output = net(x)
loss = mse(output, y)
torch.autograd.backward(loss, create_graph=True)
update1 = net.weight.grad.data.clone()
# gradient penalization (effectively, second order derivative)
gradient_penalty = (grad(output.mean(), x, create_graph=True)[0].norm() - 1).pow(2)
gradient_penalty.backward() # this will be added to the grads w.r.t. the loss
update2 = net.weight.grad.data.clone()
print(update1)
print(update2)
print((update1-update2).norm()) |
@HiiYL the problem is that you're trying to compute the grad of grad of Convolution, but it's not implemented yet. This PR only added the machinery necessary to compute higher order derivatives, but didn't actually add support to the existing autograd functions. |
After fixing my first problem based on your suggestions I also now get the same error |
It is, but we need a while to adapt the codebase (and the conference season doesn't help with that). |
Also, we're accepting PRs that adapt existing functions for higher order grads (I think there are 2 like that open at the moment, and a few already merged). |
Hi @apaszke , is there any branch or contributor working on adding support for Variable backward in auto-generated THNN functions now? |
@gchanan is going to work on it after he wraps up broadcasting, so at the moment there's no one working on it. |
@apaszke The new code could only take a second-order derivative of a scalar. Could you please tel me if there is any way to calculate Hessian matrix now? |
@liboyue you can only create a full hessian matrix with a for-loop calculating per scalar. |
Thanks @soumith . I tried to do so but it is too slow. Do you have any plan to implement this function in the future? |
Not really. The only thing we could do would be to batch the ops a bit more across scalars, but it won't give you a large speedup. Computing full hessian is very very expensive, which is why all these tricks for estimating it exist. |
@apaszke You're right. I am just trying to use Hessian. I tried several auto diff packages, PyTorch is the fastest by far. But it is still too slow. Thanks :) |
Hi all, the following code can be used to compute the full Hessian in a loop, as mentioned by @soumith. I'm wondering if anyone knows of a trick to compute the diagonal of the Hessian in a single pass? I wasn't able to come up with anything myself. Code: x = Variable(torch.FloatTensor([1,2]), requires_grad=True)
y = x[0].pow(2) * x[1]
dx, = grad(y, x, create_graph=True)
print(dx) # (4,1)'
dx_dx1, = grad(dx, x, grad_outputs=torch.FloatTensor([1,0]), retain_graph=True)
dx_dx2, = grad(dx, x, grad_outputs=torch.FloatTensor([0,1]))
print(dx_dx1) # (4,2)'
print(dx_dx2) # (2,0)' |
There's no way to do that in one go. You can only compute hessian vector products in this way, and there's no vector that will give you the diagonal when you multiply it by a matrix. |
Thanks for the explanation, makes sense! |
Progress:
The tasks below are left for future PRs. The first two can be easily paralellized and can be done by others too.
@differentiable_once
for now).backward
functions using Variables (so they can be differentiated multiple times)Summary of changes
New function definition format
Note that the old declarations are still supported - most of the core implementations are still not converted.
New format allows to implement jacobian vector products (jvp, L-op) of functions depending on jvp's of other functions (aka. grad of grad, hessian-vector products).
The new declarations look like this:
Adnotations:
__init__
method. Think of them as pairs of pure functions that are formulas specifying how to compute the function and its jvp (Dfn * grad_output
).forward
can now accept arguments of arbitrary types (used to only accept Variables). Any Variables appearing inargs
will be unpacked into Tensors. Arguments are not recursively searched. For example, a list of Variables won't be unpacked into a list of Tensors, and they won't be registered as inputs in the graph. Keyword arguments are not supported (need arg ordering to construct the graph).forward
gets actx
as a first argument - this is an object (of unspecified type - not an instance of this class) with an interface identical toself
in old style definitions (save_for_backward
,mark_non_differentiable
, etc.) and is used to pass information to thebackward
call. For example, this function needs to save ascalar
argument. Note that you shouldn't assign input or output tensors to it, however intermediate buffers are ok.grad_output
is now a Variable, and the whole backward method needs to be implemented in terms of Variables (they shouldn't be unpacked into tensors, or the derivative graph will be malformed, see notes on@once_differentiable
below).ctx
will be the same object that was passed toforward
.backward
should return gradients for all arguments given toforward
(even non-Variable arguments, but it should beNone
in such case). Unnecessary trailingNone
s are still accepted (useful whenforward
has optional arguments).For comparison, here's how a legacy definition of
MultiplyAdd
would look like:@once_differentiable
The fact that
backward
now takes Variables might unnecessarily complicate implementations of custom function that e.g. call into other libs. For that reason, this PR also introduces a@once_differentiable
decorator, that can be used to wrapbackward
. After adding it,backward
functions will get a tensorgrad_output
and will be expected to return a grad input tensor for each tensor argument given inforward
(andNone
for all other args).torch.autograd.backward
Added
create_graph
. If True, the graph for vjp will be created (defaults to False), allowing to differentiate the grad computation. Defaults to True ifgrad_variables
contains at least one non-volatile Variable, and False otherwise.Renamed
retain_variables
toretain_graph
. The old argument will remain supported until v0.3, but will print deprecation warnings. If unspecified, defaults to the value ofcreate_graph
.If
grad_variables
contains tensors, they are automatically promoted to Variables (volatile unlesscreate_graph
is True). Also, None entries ingrad_variables
are now accepted if their correspondingvariables
entries are scalar Variables (grad_output filled with 1 is allocated for them). Additionally, if allgrad_variables
could beNone
the argument is now optional.torch.autograd.grad
While Chainer-style API is great for first order grads, it doesn't work nearly as well when computing higher order derivatives. See this example:
For that reason, this PR also implements
grad
- a functional-style function that computes the vjp, and instead of accumulating it into.grad
of all leaves, it returns a list of grads w.r.t. given function inputs (parameters are considered inputs too).Example:
Arguments
outputs
,inputs
,grad_outputs
arguments can be both sequences of Variables (or Tensors and Nones in case ofgrad_outputs
), or single Variables.If one doesn't request the grad w.r.t. all leaf Variables, unneeded gradients are not computed, and won't be accumulated into them (by default
grad
has no side effects). Ifonly_inputs
argument is set to False, the whole graph will be differentiated, grads w.r.t.inputs
will be returned in a list and not accumulated into.grad
, grads w.r.t. all other leaves will be accumulated into their.grad
..grad
semanticsBy default the semantics are the same as right now. When not using any of the options implemented in this PR,
.grad
Variables will be volatile, and incoming grads will be accumulated in-place (both Variable and its.data
will be the same objects - while we don't guarantee that, some people depend on that in their scripts, so it's best to support it unless there's no other way).However, when using derivative graphs, these Variables will need to have their
.grad_fn
set correctly, and shouldn't be modified in-place (they might have been used in some functions!). For that reason in such cases the.grad
attribute will point to a new Variable, with new.data
, after each accumulation.To sum up:
.grad
.grad
.grad
which remains volatile.grad
.grad
Implementation details
apply
function won't be called (all gradients will default to null, which is an equivalent of 0), and itsnext_functions
won't be added to the ready queue (unless they are already waiting for execution and this was their last dependency).