[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

mman · 2019-01-31T10:12:48Z

Description

Running Docker version 18.09.1, build 4c52b90 on Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-44-generic x86_64). My various containers, all running swift code based on official swift images from https://hub.docker.com/_/swift latest tag are crashing randomly every couple of days.

I only receive a message "shim reaped" and because my docker container restart policy is set to unless-stopped the container restarts automatically and runs again for couple days.

Looking at all the logs and docker stats the container memory usage is stable, process inside the container is not reporting any troubles.

Enabled containerd debug and found the stack trace pasted below. The same stack trace is reported from multiple containers running different swift projects.

Describe the results you received:

Container randomly restarting without any apparent cause, only "shim reaped" being reported to the logs.

Describe the results you expected:

Container running smoothly for years :)

I'm not sure if this "shim reaped" is a result of silent crash of the process inside the container, in which case I'd love to get more diagnostics as to what happened and to which process. Right now I'm not sure if this is the crash of my app, or crash of containers-shim. Please help me clarify. The stack trace below is pointing to null pointer dereference in containerd go code.

Output of containerd --version:

containerd github.com/containerd/containerd 1.2.2 9754871865f7fe2f4e74d43e2fc7ccd237edcbce

Jan 31 08:24:11 pmx-2 containerd[784]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 31 08:24:11 pmx-2 containerd[784]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488]
Jan 31 08:24:11 pmx-2 containerd[784]: goroutine 18 [running]:
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/linux/proc.(*execProcess).pidv(...)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec.go:76
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/linux/proc.(*execStoppedState).Pid(0x8d8cf0, 0xffffffffffffffff)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec_state.go:175 +0x8
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/linux/proc.(*execProcess).Pid(0xc420088000, 0x6b2)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec.go:72 +0x34
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/shim.(*Service).checkProcesses(0xc420136000, 0xbf0cc686daab6e4b, 0x42f0c5aab511, 0x8bc720, 0x6ab7, 0x0)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:514 +0xde
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc420136000)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:492 +0xd0
Jan 31 08:24:11 pmx-2 containerd[784]: created by github.com/containerd/containerd/runtime/v1/shim.NewService
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:91 +0x3e9
Jan 31 08:24:11 pmx-2 containerd[784]: time="2019-01-31T08:24:11.458001949+01:00" level=info msg="shim reaped" id=0f04f4b172dbd5f40b27e98ba6559ad072852cca84c1410792883a35c0cc7076
Jan 31 08:24:11 pmx-2 containerd[784]: time="2019-01-31T08:24:11.458352273+01:00" level=warning msg="cleaning up after killed shim" id=0f04f4b172dbd5f40b27e98ba6559ad072852cca84c1410792883a35c0cc7076 namespace=moby```

The text was updated successfully, but these errors were encountered:

Random-Liu · 2019-02-01T02:34:01Z

This looks like a race condition introduced in #2826

I don't understand why we can simply remove the lock for stopped state.

Based on the PR description, I think we can have a finer grained lock for pid, instead of removing the lock directly. Removing the lock introduces race conditions, the execState itself can be updated in transition, but it is not protected in Pid().

I mark this p0, because this means that if we exec into a container multiple times, it may panic the containerd-shim... Which sounds really really bad to me. Think about that users usually use exec to do liveness probe, but the liveness probe may kill the containrd-shim and eventually kill the container if I remember the cleanupAfterDeadShim logic correctly...

mman · 2019-02-01T07:01:49Z

You are correct that my crashing containers all use health check probes and health check log message is the last one I see before the shim is reaped, thus your multi exec race seems valid.

mman · 2019-02-11T10:23:47Z

Thanks for your great work guys, anything I can help do to speed up official push of the release 1.2.3 to the docker?

tom0392 · 2023-11-21T16:14:36Z

This looks like a race condition introduced in #2826

I don't understand why we can simply remove the lock for stopped state.

Based on the PR description, I think we can have a finer grained lock for pid, instead of removing the lock directly. Removing the lock introduces race conditions, the execState itself can be updated in transition, but it is not protected in Pid().

I mark this p0, because this means that if we exec into a container multiple times, it may panic the containerd-shim... Which sounds really really bad to me. Think about that users usually use exec to do liveness probe, but the liveness probe may kill the containrd-shim and eventually kill the container if I remember the cleanupAfterDeadShim logic correctly...

Here are our two questions about this bug:
1、Since it happens occasionally, is there any way to reproduce this problem as soon as possible?
2、What do you mean by "the execState itself can be updated in transition, but it is not protected in Pid()." ? I thought about it several times, but I didn’t fully understand it.
@Random-Liu Looking forward to your reply. Thanks.

Random-Liu added the kind/bug label Feb 1, 2019

Random-Liu added the priority/P0 label Feb 1, 2019

Random-Liu mentioned this issue Feb 1, 2019

Fix exec race condition #2970

Merged

crosbymichael closed this as completed in #2970 Feb 1, 2019

mman mentioned this issue Feb 11, 2019

Please update containerd.io to version 1.2.3 (random container crash due to containerd.io bug) docker/for-linux#586

Closed

3 tasks

nirmaldavis mentioned this issue May 6, 2019

Random task failures with "non-zero exit (137)" since 18.09.1 moby/moby#38768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

mman commented Jan 31, 2019

Random-Liu commented Feb 1, 2019 •

edited

mman commented Feb 1, 2019

mman commented Feb 11, 2019

tom0392 commented Nov 21, 2023

[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

Comments

mman commented Jan 31, 2019

Random-Liu commented Feb 1, 2019 • edited

mman commented Feb 1, 2019

mman commented Feb 11, 2019

tom0392 commented Nov 21, 2023

Random-Liu commented Feb 1, 2019 •

edited