pytorch - connection between loss.backward() and optimizer.step()

Question

Where is an explicit connection between the optimizer and the loss?

How does the optimizer know where to get the gradients of the loss without a call liks this optimizer.step(loss)?

-More context-

When I minimize the loss, I didn't have to pass the gradients to the optimizer.

loss.backward() # Back Propagation
optimizer.step() # Gradient Descent

Shai · Accepted Answer · 2021-02-08 18:25:20Z

141

Without delving too deep into the internals of pytorch, I can offer a simplistic answer:

Recall that when initializing optimizer you explicitly tell it what parameters (tensors) of the model it should be updating. The gradients are "stored" by the tensors themselves (they have a grad and a requires_grad attributes) once you call backward() on the loss. After computing the gradients for all tensors in the model, calling optimizer.step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.

More info on computational graphs and the additional "grad" information stored in pytorch tensors can be found in this answer.

Referencing the parameters by the optimizer can sometimes cause troubles, e.g., when the model is moved to GPU after initializing the optimizer. Make sure you are done setting up your model before constructing the optimizer. See this answer for more details.

edited Feb 8, 2021 at 18:25

answered Dec 30, 2018 at 6:39

Shai

113k39 gold badges250 silver badges382 bronze badges

26

@Aerin it's not a trivial connection... One would have expect optimizer.step to get loss.backward() as an argument. However, it all happens "behind the curtain"...
– Shai
Dec 30, 2018 at 6:52
20

So how does optimizer.step() get the gradient value from loss.backward(). It seems that this answer hasn't answered the mechanism for the "connection". Optimizer has reference to model parameters. But loss function is completely on its own. It doens't look like it has reference to model or optimizer.
– mofury
Mar 29, 2020 at 21:53
5

@cfeng the loss function is not on its own at all! It is the final leaf in a single gigantic computational graph which starts with the model inputs and contains all model parameters. This graph is computed for each batch and results in a single scalar number on each batch. When we do loss.backward() the process of backpropagation starts at the loss and goes through all of its parents all the way to model inputs. All nodes in the graph contain a reference to their parent.
– pseudomarvin
Aug 29, 2020 at 20:12
4

@mofury The question isn't that simple to answer in short. Roughly speaking, first, the instance of a loss function class, say, an instance of the nn.CrossEntropyLoss can be called and return a Tensor. That's important, this Tensor object has a grad_fn prop in which there stores tensors it is derived from. And those tensors also have such a prop so that the backward function can do a backpropagation through such props and eventually arrive at the parameters we want to optimize in the model. You can refer to this: pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html.
– C.K.
Aug 2, 2021 at 13:30
2

@MonaJalal where did you see an optimizer in a validation code? Optimizer has nothing to do with validation/test
– Shai
Mar 9, 2022 at 19:13

| Show 5 more comments

Morteza Jalambadani · Accepted Answer · 2019-02-27 14:32:32Z

58

When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True and store them in parameter.grad attribute for every parameter.

optimizer.step() updates all the parameters based on parameter.grad

edited Feb 27, 2019 at 14:32

Morteza Jalambadani

2,2856 gold badges23 silver badges39 bronze badges

answered Feb 27, 2019 at 13:26

Ganesh

5813 silver badges2 bronze badges

6

loss is computing the loss between two tensors with no relation to a network. How does loss.backward() know which network it needs to reference and compute parameter.grad for?
– Aziz Alfoudari
May 21, 2020 at 7:53
@AzizAlfoudari see my answer for a clarification attempt :).
– pseudomarvin
Aug 29, 2020 at 20:14

Add a comment |

pseudomarvin · Accepted Answer · 2022-09-09 07:45:35Z

Perhaps this will clarify a little the connection between loss.backward and optim.step (although the other answers are to the point).

# Our "model"
x = torch.tensor([1., 2.], requires_grad=True)
y = 100*x

# Compute loss
loss = y.sum()

# Compute gradient of the loss w.r.t. to the parameters  
print(x.grad)     # None
loss.backward()      
print(x.grad)     # tensor([100., 100.])

# MOdify the parameters by subtracting the gradient
optim = torch.optim.SGD([x], lr=0.001)
print(x)        # tensor([1., 2.], requires_grad=True)
optim.step()
print(x)        # tensor([0.9000, 1.9000], requires_grad=True)

loss.backward() sets the grad attribute of all tensors with requires_grad=True in the computational graph of which loss is the leaf (only x in this case).

Optimizer just iterates through the list of parameters (tensors) it received on initialization and everywhere where a tensor has requires_grad=True, it subtracts the value of its gradient stored in its .grad property (simply multiplied by the learning rate in case of SGD). It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property so it can do x = x - lr * x.grad

Note that if we were doing this in a train loop we would call optim.zero_grad() because in each train step we want to compute new gradients - we don't care about gradients from the previous batch. Not zeroing grads would lead to gradient accumulation across batches.

I like this kind of "hands on" explanation to understand things. Thanks, makes much more sense to me! — kushy, Sep 25, 2020 at 16:23
It should be "Compute gradients of the loss w.r.t. the parameters." — passerby51, Sep 9, 2022 at 0:25

LollipopKnight · Accepted Answer · 2021-02-14 03:55:04Z

Some answers explained well, but I'd like to give a specific example to explain the mechanism.

Suppose we have a function : z = 3 x^2 + y^3.
The updating gradient formula of z w.r.t x and y is:

initial values are x=1 and y=2.

x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([2.0], requires_grad=True)
z = 3*x**2+y**3

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  None
y.grad:  None
z.grad:  None

Then calculating the gradient of x and y in current value (x=1, y=2)

# calculate the gradient
z.backward()

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  tensor([6.])
y.grad:  tensor([12.])
z.grad:  None

Finally, using SGD optimizer to update the value of x and y according the formula:

# create an optimizer, pass x,y as the paramaters to be update, setting the learning rate lr=0.1
optimizer = optim.SGD([x, y], lr=0.1)

# executing an update step
optimizer.step()

# print the updated values of x and y
print("x:", x)
print("y:", y)

# print result should be:
x: tensor([0.4000], requires_grad=True)
y: tensor([0.8000], requires_grad=True)

Very good explanations, the analogy with the real example is great honestly — Timbus Calin, Jun 25, 2022 at 11:20

Akavall · Accepted Answer · 2020-07-29 03:48:28Z

31

Let's say we defined a model: model, and loss function: criterion and we have the following sequence of steps:

pred = model(input)
loss = criterion(pred, true_labels)
loss.backward()

pred will have an grad_fn attribute, that references a function that created it, and ties it back to the model. Therefore, loss.backward() will have information about the model it is working with.

Try removing grad_fn attribute, for example with:

pred = pred.clone().detach()

Then the model gradients will be None and consequently weights will not get updated.

And the optimizer is tied to the model because we pass model.parameters() when we create the optimizer.

edited Jul 29, 2020 at 3:48

answered May 25, 2020 at 23:49

Akavall

84.5k52 gold badges210 silver badges256 bronze badges

1

shouldn't it be "Then the models gradients will not get updated.", since loss.backward() updates the gradients?
– zwithouta
Jul 28, 2020 at 13:46
@zwithouta, Thanks, this is a good point. I updated my answer.
– Akavall
Jul 29, 2020 at 3:49
Thanks, the fact that grad_fn it the tensor is a reference to the model was the missing piece of the puzzle for me.
– gog
Mar 1 at 18:04

Add a comment |

villa121 · Accepted Answer · 2023-09-12 05:46:32Z

Maybe some of you are still a little confused because of the way the loss is being called.

These are the main 3 steps to make the backprop and then update parameters.

loss = loss_function(y_hat, Y_train)
loss.backward()
optimizer.step()

I was lost about the connection, but the key is the loss definition call. Every loss function returns a Tensor as well. This tensor includes ALL the parameters used to calculate y_hat and also the Y_train parameters (this won't be a problem because of the observer in the optimizer).

When you declare an optimizer you'll add as an argument the params that needs to observed.

optimizer = torch.optim.SGD(params=model.parameters(), lr=0.001)

This means that, when the optimizers is called, it's going to calculate all the gradients (or derivatives) of the model AND THAT implies the gradients stored on y_hat (and not Y_train because is not included in the observer).

So the connection between the loss and the optim is y_hat or prediction output that contains the common grads from the model.

To summarize

#Some model created
model = MyModel() 

#Optimizer to use
optimizer = torch.optim.SGD(params=model_1.parameters(), lr=0.001) 

#Loss function to apply
loss_function = torch.nn.L1Loss() 

#The output of this call is a Tensor with all the parameters from the model
loss = loss_function(y_hat, Y_train) 

#This is going to calculate the gradients of the values of y_hat (same as model)
loss.backward() 

#Inside each parameter there is already a grad calculated, this only applies the gradient descent algorithm to the observed parameters
optimizer.step()

sɐunıɔןɐqɐp · Accepted Answer · 2020-08-02 09:31:39Z

-2

Short answer:

loss.backward() # do gradient of all parameters for which we set required_grad= True. parameters could be any variable defined in code, like h2h or i2h.

optimizer.step() # according to the optimizer function (defined previously in our code), we update those parameters to finally get the minimum loss(error).

edited Aug 2, 2020 at 9:31

sɐunıɔןɐqɐp

3,40216 gold badges37 silver badges41 bronze badges

answered Aug 2, 2020 at 7:56

pourya

671 silver badge3 bronze badges

Add a comment |

Collectives™ on Stack Overflow

pytorch - connection between loss.backward() and optimizer.step()

7 Answers 7

Your Answer

Not the answer you're looking for? Browse other questions tagged
machine-learning
neural-network
pytorch
gradient-descent
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged machine-learningneural-networkpytorchgradient-descent or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
machine-learning
neural-network
pytorch
gradient-descent
or ask your own question.