188

Where is an explicit connection between the optimizer and the loss?

How does the optimizer know where to get the gradients of the loss without a call liks this optimizer.step(loss)?

-More context-

When I minimize the loss, I didn't have to pass the gradients to the optimizer.

loss.backward() # Back Propagation
optimizer.step() # Gradient Descent

7 Answers 7

141

Without delving too deep into the internals of pytorch, I can offer a simplistic answer:

Recall that when initializing optimizer you explicitly tell it what parameters (tensors) of the model it should be updating. The gradients are "stored" by the tensors themselves (they have a grad and a requires_grad attributes) once you call backward() on the loss. After computing the gradients for all tensors in the model, calling optimizer.step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.

More info on computational graphs and the additional "grad" information stored in pytorch tensors can be found in this answer.

Referencing the parameters by the optimizer can sometimes cause troubles, e.g., when the model is moved to GPU after initializing the optimizer. Make sure you are done setting up your model before constructing the optimizer. See this answer for more details.

10
  • 26
    @Aerin it's not a trivial connection... One would have expect optimizer.step to get loss.backward() as an argument. However, it all happens "behind the curtain"...
    – Shai
    Dec 30, 2018 at 6:52
  • 20
    So how does optimizer.step() get the gradient value from loss.backward(). It seems that this answer hasn't answered the mechanism for the "connection". Optimizer has reference to model parameters. But loss function is completely on its own. It doens't look like it has reference to model or optimizer.
    – mofury
    Mar 29, 2020 at 21:53
  • 5
    @cfeng the loss function is not on its own at all! It is the final leaf in a single gigantic computational graph which starts with the model inputs and contains all model parameters. This graph is computed for each batch and results in a single scalar number on each batch. When we do loss.backward() the process of backpropagation starts at the loss and goes through all of its parents all the way to model inputs. All nodes in the graph contain a reference to their parent. Aug 29, 2020 at 20:12
  • 4
    @mofury The question isn't that simple to answer in short. Roughly speaking, first, the instance of a loss function class, say, an instance of the nn.CrossEntropyLoss can be called and return a Tensor. That's important, this Tensor object has a grad_fn prop in which there stores tensors it is derived from. And those tensors also have such a prop so that the backward function can do a backpropagation through such props and eventually arrive at the parameters we want to optimize in the model. You can refer to this: pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html.
    – C.K.
    Aug 2, 2021 at 13:30
  • 2
    @MonaJalal where did you see an optimizer in a validation code? Optimizer has nothing to do with validation/test
    – Shai
    Mar 9, 2022 at 19:13
58

When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True and store them in parameter.grad attribute for every parameter.

optimizer.step() updates all the parameters based on parameter.grad

2
  • 6
    loss is computing the loss between two tensors with no relation to a network. How does loss.backward() know which network it needs to reference and compute parameter.grad for? May 21, 2020 at 7:53
  • @AzizAlfoudari see my answer for a clarification attempt :). Aug 29, 2020 at 20:14
57

Perhaps this will clarify a little the connection between loss.backward and optim.step (although the other answers are to the point).

# Our "model"
x = torch.tensor([1., 2.], requires_grad=True)
y = 100*x

# Compute loss
loss = y.sum()

# Compute gradient of the loss w.r.t. to the parameters  
print(x.grad)     # None
loss.backward()      
print(x.grad)     # tensor([100., 100.])

# MOdify the parameters by subtracting the gradient
optim = torch.optim.SGD([x], lr=0.001)
print(x)        # tensor([1., 2.], requires_grad=True)
optim.step()
print(x)        # tensor([0.9000, 1.9000], requires_grad=True)

loss.backward() sets the grad attribute of all tensors with requires_grad=True in the computational graph of which loss is the leaf (only x in this case).

Optimizer just iterates through the list of parameters (tensors) it received on initialization and everywhere where a tensor has requires_grad=True, it subtracts the value of its gradient stored in its .grad property (simply multiplied by the learning rate in case of SGD). It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property so it can do x = x - lr * x.grad

Note that if we were doing this in a train loop we would call optim.zero_grad() because in each train step we want to compute new gradients - we don't care about gradients from the previous batch. Not zeroing grads would lead to gradient accumulation across batches.

2
  • 7
    I like this kind of "hands on" explanation to understand things. Thanks, makes much more sense to me!
    – kushy
    Sep 25, 2020 at 16:23
  • 1
    It should be "Compute gradients of the loss w.r.t. the parameters."
    – passerby51
    Sep 9, 2022 at 0:25
35

Some answers explained well, but I'd like to give a specific example to explain the mechanism.

Suppose we have a function : z = 3 x^2 + y^3.
The updating gradient formula of z w.r.t x and y is:

enter image description here

initial values are x=1 and y=2.

x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([2.0], requires_grad=True)
z = 3*x**2+y**3

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  None
y.grad:  None
z.grad:  None

Then calculating the gradient of x and y in current value (x=1, y=2)

enter image description here

# calculate the gradient
z.backward()

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  tensor([6.])
y.grad:  tensor([12.])
z.grad:  None

Finally, using SGD optimizer to update the value of x and y according the formula: enter image description here

# create an optimizer, pass x,y as the paramaters to be update, setting the learning rate lr=0.1
optimizer = optim.SGD([x, y], lr=0.1)

# executing an update step
optimizer.step()

# print the updated values of x and y
print("x:", x)
print("y:", y)

# print result should be:
x: tensor([0.4000], requires_grad=True)
y: tensor([0.8000], requires_grad=True)
2
  • 3
    This is a great explanation along with the code. Thanks. Jul 22, 2021 at 2:33
  • Very good explanations, the analogy with the real example is great honestly Jun 25, 2022 at 11:20
31

Let's say we defined a model: model, and loss function: criterion and we have the following sequence of steps:

pred = model(input)
loss = criterion(pred, true_labels)
loss.backward()

pred will have an grad_fn attribute, that references a function that created it, and ties it back to the model. Therefore, loss.backward() will have information about the model it is working with.

Try removing grad_fn attribute, for example with:

pred = pred.clone().detach()

Then the model gradients will be None and consequently weights will not get updated.

And the optimizer is tied to the model because we pass model.parameters() when we create the optimizer.

3
  • 1
    shouldn't it be "Then the models gradients will not get updated.", since loss.backward() updates the gradients?
    – zwithouta
    Jul 28, 2020 at 13:46
  • @zwithouta, Thanks, this is a good point. I updated my answer.
    – Akavall
    Jul 29, 2020 at 3:49
  • Thanks, the fact that grad_fn it the tensor is a reference to the model was the missing piece of the puzzle for me.
    – gog
    Mar 1 at 18:04
0

Maybe some of you are still a little confused because of the way the loss is being called.

These are the main 3 steps to make the backprop and then update parameters.

loss = loss_function(y_hat, Y_train)
loss.backward()
optimizer.step()

I was lost about the connection, but the key is the loss definition call. Every loss function returns a Tensor as well. This tensor includes ALL the parameters used to calculate y_hat and also the Y_train parameters (this won't be a problem because of the observer in the optimizer).

When you declare an optimizer you'll add as an argument the params that needs to observed.

optimizer = torch.optim.SGD(params=model.parameters(), lr=0.001)

This means that, when the optimizers is called, it's going to calculate all the gradients (or derivatives) of the model AND THAT implies the gradients stored on y_hat (and not Y_train because is not included in the observer).

So the connection between the loss and the optim is y_hat or prediction output that contains the common grads from the model.

To summarize

#Some model created
model = MyModel() 

#Optimizer to use
optimizer = torch.optim.SGD(params=model_1.parameters(), lr=0.001) 

#Loss function to apply
loss_function = torch.nn.L1Loss() 

#The output of this call is a Tensor with all the parameters from the model
loss = loss_function(y_hat, Y_train) 

#This is going to calculate the gradients of the values of y_hat (same as model)
loss.backward() 

#Inside each parameter there is already a grad calculated, this only applies the gradient descent algorithm to the observed parameters
optimizer.step() 
-2

Short answer:

loss.backward() # do gradient of all parameters for which we set required_grad= True. parameters could be any variable defined in code, like h2h or i2h.

optimizer.step() # according to the optimizer function (defined previously in our code), we update those parameters to finally get the minimum loss(error).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.