Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

Closed
mman opened this issue Jan 31, 2019 · 4 comments · Fixed by #2970
Closed

[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488] #2969

mman opened this issue Jan 31, 2019 · 4 comments · Fixed by #2970

Comments

@mman
Copy link

mman commented Jan 31, 2019

Description

Running Docker version 18.09.1, build 4c52b90 on Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-44-generic x86_64). My various containers, all running swift code based on official swift images from https://hub.docker.com/_/swift latest tag are crashing randomly every couple of days.

I only receive a message "shim reaped" and because my docker container restart policy is set to unless-stopped the container restarts automatically and runs again for couple days.

Looking at all the logs and docker stats the container memory usage is stable, process inside the container is not reporting any troubles.

Enabled containerd debug and found the stack trace pasted below. The same stack trace is reported from multiple containers running different swift projects.

Describe the results you received:

Container randomly restarting without any apparent cause, only "shim reaped" being reported to the logs.

Describe the results you expected:

Container running smoothly for years :)

I'm not sure if this "shim reaped" is a result of silent crash of the process inside the container, in which case I'd love to get more diagnostics as to what happened and to which process. Right now I'm not sure if this is the crash of my app, or crash of containers-shim. Please help me clarify. The stack trace below is pointing to null pointer dereference in containerd go code.

Output of containerd --version:

containerd github.com/containerd/containerd 1.2.2 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
Jan 31 08:24:11 pmx-2 containerd[784]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 31 08:24:11 pmx-2 containerd[784]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488]
Jan 31 08:24:11 pmx-2 containerd[784]: goroutine 18 [running]:
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/linux/proc.(*execProcess).pidv(...)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec.go:76
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/linux/proc.(*execStoppedState).Pid(0x8d8cf0, 0xffffffffffffffff)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec_state.go:175 +0x8
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/linux/proc.(*execProcess).Pid(0xc420088000, 0x6b2)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec.go:72 +0x34
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/shim.(*Service).checkProcesses(0xc420136000, 0xbf0cc686daab6e4b, 0x42f0c5aab511, 0x8bc720, 0x6ab7, 0x0)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:514 +0xde
Jan 31 08:24:11 pmx-2 containerd[784]: github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc420136000)
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:492 +0xd0
Jan 31 08:24:11 pmx-2 containerd[784]: created by github.com/containerd/containerd/runtime/v1/shim.NewService
Jan 31 08:24:11 pmx-2 containerd[784]: #011/go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:91 +0x3e9
Jan 31 08:24:11 pmx-2 containerd[784]: time="2019-01-31T08:24:11.458001949+01:00" level=info msg="shim reaped" id=0f04f4b172dbd5f40b27e98ba6559ad072852cca84c1410792883a35c0cc7076
Jan 31 08:24:11 pmx-2 containerd[784]: time="2019-01-31T08:24:11.458352273+01:00" level=warning msg="cleaning up after killed shim" id=0f04f4b172dbd5f40b27e98ba6559ad072852cca84c1410792883a35c0cc7076 namespace=moby```
@Random-Liu
Copy link
Member

Random-Liu commented Feb 1, 2019

This looks like a race condition introduced in #2826

I don't understand why we can simply remove the lock for stopped state.

Based on the PR description, I think we can have a finer grained lock for pid, instead of removing the lock directly. Removing the lock introduces race conditions, the execState itself can be updated in transition, but it is not protected in Pid().

I mark this p0, because this means that if we exec into a container multiple times, it may panic the containerd-shim... Which sounds really really bad to me. Think about that users usually use exec to do liveness probe, but the liveness probe may kill the containrd-shim and eventually kill the container if I remember the cleanupAfterDeadShim logic correctly...

@mman
Copy link
Author

mman commented Feb 1, 2019

You are correct that my crashing containers all use health check probes and health check log message is the last one I see before the shim is reaped, thus your multi exec race seems valid.

@mman
Copy link
Author

mman commented Feb 11, 2019

Thanks for your great work guys, anything I can help do to speed up official push of the release 1.2.3 to the docker?

@tom0392
Copy link

tom0392 commented Nov 21, 2023

This looks like a race condition introduced in #2826

I don't understand why we can simply remove the lock for stopped state.

Based on the PR description, I think we can have a finer grained lock for pid, instead of removing the lock directly. Removing the lock introduces race conditions, the execState itself can be updated in transition, but it is not protected in Pid().

I mark this p0, because this means that if we exec into a container multiple times, it may panic the containerd-shim... Which sounds really really bad to me. Think about that users usually use exec to do liveness probe, but the liveness probe may kill the containrd-shim and eventually kill the container if I remember the cleanupAfterDeadShim logic correctly...

Here are our two questions about this bug:
1、Since it happens occasionally, is there any way to reproduce this problem as soon as possible?
2、What do you mean by "the execState itself can be updated in transition, but it is not protected in Pid()." ? I thought about it several times, but I didn’t fully understand it.
@Random-Liu Looking forward to your reply. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants