Skip to content

duplicate task in each work. #3270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anan-lee opened this issue Jun 23, 2016 · 47 comments
Closed

duplicate task in each work. #3270

anan-lee opened this issue Jun 23, 2016 · 47 comments

Comments

@anan-lee
Copy link

anan-lee commented Jun 23, 2016

celery: 3.1.23 (Cipater)
rabbitmq-server:3.2.4-1

i found some task send to rabbitmq and then celery worker get duplicate task .

function:
def do_backup(name):
....

for example:
server: do_backup.delay("/tmp/__36")

worker02:
[2016-06-23 20:38:27,182: INFO/MainProcess] Received task: tasks.do_backup[40159186-1b88-4d98-aabf-839bb4b7a852]
[2016-06-23 20:38:27,183: INFO/MainProcess] tasks.do_backup[40159186-1b88-4d98-aabf-839bb4b7a852]: get file name: /tmp/__36
worker12:
[2016-06-23 20:36:21,647: INFO/MainProcess] Received task: tasks.do_backup[d265ffba-fa21-48b6-baad-2c0185a3168f]
........
[2016-06-23 20:37:01,492: INFO/MainProcess] tasks.do_backup[d265ffba-fa21-48b6-baad-2c0185a3168f]: get file name: /tmp/__36
[2016-06-23 20:37:01,554: INFO/MainProcess] tasks.do_backup[d265ffba-fa21-48b6-baad-2c0185a3168f]: do task

i make sure the server only send a "/tmp/__36"

@ask
Copy link
Contributor

ask commented Jun 28, 2016

They don't seem to be duplicates as they have different task ids?

@anan-lee
Copy link
Author

anan-lee commented Jul 5, 2016

thanks for your reply, It is a fact that the same data (/tmp/__36) send to two works . I think the one task don't send ack in time, and so the rabbitmq send the this task to another one?

@DurandA
Copy link

DurandA commented Sep 25, 2016

I have the same issue, I can see on the beat log that the task is only sent once but received twice by the worker a few seconds apart. When using CELERY_ENABLE_UTC=True, the worker receive two tasks, one in UTC time and the other in local time.

I'm using Celery 3.1.23 (Cipater) with Redis 3.2.3.

@cesterlizi
Copy link

cesterlizi commented Oct 7, 2016

Same Issue, but taskid is the same @ask @DurandA. Celery 3.1.23, Rabbit 3.6.5
I've also added some debug to verify that is ack -> https://github.com/cesterlizi/celery/blob/3daf891f7db10f3f4c6cd89da3c39cb637400b1f/celery/worker/job.py#L532-L534
late ack is enabled

[2016-10-07 12:08:25,875: INFO/MainProcess] Received task: recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]
[2016-10-07 12:08:25,876: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7fa3f9104e60> (args:(u'recrepli.tasks.replicate', u'7c508306-772f-464e-bb7c-103bc1dca784', [u'frogger', u'/home/gc/recordings_tmp/919/call_98277604.wav'], {}, {u'task': u'recrepli.tasks.replicate', u'group': None, u'is_eager': False, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'replicate', u'exchange': u'celery'}, u'args': [u'frogger', u'/home/gc/recordings_tmp/919/call_98277604.wav'], u'headers': {}, u'correlation_id': None, u'hostname': 'replicate_worker@toro', u'kwargs': {}, u'reply_to': None, u'id': u'7c508306-772f-464e-bb7c-103bc1dca784'}) kwargs:{})
[2016-10-07 12:08:25,876: DEBUG/MainProcess] on_accepted() for task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]
[2016-10-07 12:08:25,876: INFO/Worker-8] recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]: Intentando replicar 0/None /home/gc/recordings_tmp/919/call_98277604.wav from AS frogger
[2016-10-07 12:08:25,876: DEBUG/MainProcess] Task accepted: recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784] pid:26289
[2016-10-07 12:08:25,876: DEBUG/Worker-8] recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]: Colectando /home/gc/recordings_tmp/919/call_98277604.wav from as frogger
......

[2016-10-07 12:08:26,587: DEBUG/MainProcess] on_sucess() for task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784] return '0'
[2016-10-07 12:08:26,587: DEBUG/MainProcess] acknowledge(): for task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]
[2016-10-07 12:08:26,587: INFO/MainProcess] Task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784] succeeded in 0.710850441828s: 0

[2016-10-07 12:08:27,504: INFO/MainProcess] Received task: recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]
[2016-10-07 12:08:32,776: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7fa3f9104e60> (args:(u'recrepli.tasks.replicate', u'7c508306-772f-464e-bb7c-103bc1dca784', [u'frogger', u'/home/gc/recordings_tmp/919/call_98277604.wav'], {}, {u'task': u'recrepli.tasks.replicate', u'group': None, u'is_eager': False, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'replicate', u'exchange': u'celery'}, u'args': [u'frogger', u'/home/gc/recordings_tmp/919/call_98277604.wav'], u'headers': {}, u'correlation_id': None, u'hostname': 'replicate_worker@toro', u'kwargs': {}, u'reply_to': None, u'id': u'7c508306-772f-464e-bb7c-103bc1dca784'}) kwargs:{})
[2016-10-07 12:08:40,273: INFO/Worker-5] recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]: Intentando replicar 0/None /home/gc/recordings_tmp/919/call_98277604.wav from AS frogger
[2016-10-07 12:08:40,273: DEBUG/Worker-5] recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]: Colectando /home/gc/recordings_tmp/919/call_98277604.wav from as frogger
[2016-10-07 12:08:40,273: DEBUG/MainProcess] on_accepted() for task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]
[2016-10-07 12:08:40,273: DEBUG/MainProcess] Task accepted: recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784] pid:26286
....

[2016-10-07 12:08:40,286: DEBUG/MainProcess] on_sucess() for task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784] return '1'
[2016-10-07 12:08:40,286: DEBUG/MainProcess] acknowledge(): for task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784]
[2016-10-07 12:08:40,286: INFO/MainProcess] Task recrepli.tasks.replicate[7c508306-772f-464e-bb7c-103bc1dca784] succeeded in 0.0134955914691s: 1

@cesterlizi
Copy link

@ask I'am also attaching a log where yo can see that the task is duplicated and executed in parallel

[2016-10-07 12:13:25,966: INFO/MainProcess] Received task: recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]
[2016-10-07 12:13:25,966: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7fa3f9104e60> (args:(u'recrepli.tasks.replicate', u'8935556a-f6c5-4e0e-8709-d5958e437e08', [u'pacman', u'/home/gc/recordings_tmp/239/call_98278775.wav'], {}, {u'task': u'recrepli.tasks.replicate', u'group': None, u'is_eager': False, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'replicate', u'exchange': u'celery'}, u'args': [u'pacman', u'/home/gc/recordings_tmp/239/call_98278775.wav'], u'headers': {}, u'correlation_id': None, u'hostname': 'replicate_worker@toro', u'kwargs': {}, u'reply_to': None, u'id': u'8935556a-f6c5-4e0e-8709-d5958e437e08'}) kwargs:{})
[2016-10-07 12:13:25,967: DEBUG/MainProcess] on_accepted() for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]
[2016-10-07 12:13:25,967: INFO/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Intentando replicar 0/None /home/gc/recordings_tmp/239/call_98278775.wav from AS pacman
[2016-10-07 12:13:25,967: DEBUG/MainProcess] Task accepted: recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] pid:26287
[2016-10-07 12:13:25,967: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Colectando /home/gc/recordings_tmp/239/call_98278775.wav from as pacman
[2016-10-07 12:13:26,481: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Colectado /home/gc/recordings_collector/call_239_98278775.wav
[2016-10-07 12:13:26,482: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Convirtiendo a ogg /home/gc/recordings_collector/call_239_98278775.wav
[2016-10-07 12:13:26,754: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Ogg file /home/gc/recordings_collector/call_239_98278775.ogg
[2016-10-07 12:13:26,754: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Intentando replicar a los hosts
[2016-10-07 12:13:27,464: INFO/MainProcess] Received task: recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]
[2016-10-07 12:13:27,464: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7fa3f9104e60> (args:(u'recrepli.tasks.replicate', u'8935556a-f6c5-4e0e-8709-d5958e437e08', [u'pacman', u'/home/gc/recordings_tmp/239/call_98278775.wav'], {}, {u'task': u'recrepli.tasks.replicate', u'group': None, u'is_eager': False, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'replicate', u'exchange': u'celery'}, u'args': [u'pacman', u'/home/gc/recordings_tmp/239/call_98278775.wav'], u'headers': {}, u'correlation_id': None, u'hostname': 'replicate_worker@toro', u'kwargs': {}, u'reply_to': None, u'id': u'8935556a-f6c5-4e0e-8709-d5958e437e08'}) kwargs:{})
[2016-10-07 12:13:27,465: DEBUG/MainProcess] on_accepted() for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]
[2016-10-07 12:13:27,465: DEBUG/MainProcess] Task accepted: recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] pid:26291
[2016-10-07 12:13:27,465: INFO/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Intentando replicar 0/None /home/gc/recordings_tmp/239/call_98278775.wav from AS pacman
[2016-10-07 12:13:27,465: DEBUG/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Colectando /home/gc/recordings_tmp/239/call_98278775.wav from as pacman
[2016-10-07 12:13:27,980: DEBUG/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Colectado /home/gc/recordings_collector/call_239_98278775.wav
[2016-10-07 12:13:27,980: DEBUG/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Convirtiendo a ogg /home/gc/recordings_collector/call_239_98278775.wav
[2016-10-07 12:13:27,986: DEBUG/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Ogg file /home/gc/recordings_collector/call_239_98278775.ogg
[2016-10-07 12:13:27,986: DEBUG/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Intentando replicar a los hosts
[2016-10-07 12:13:28,627: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Replicado /home/gc/recordings_collector/call_239_98278775.ogg to hosts ['toro', 'bermeja', 'marsupial', 'pepe']
[2016-10-07 12:13:28,714: INFO/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Indexado: /home/gc/recordings_collector/call_239_98278775.ogg : {'timestamp': 1475853206, '_rev': u'1-ff292a5338b1a81855f44f1774b10010', 'storage': ['toro', 'bermeja', 'marsupial', 'pepe'], 'doctype': 'recording', 'filename': 'call_98278775.ogg', 'channels': 1, 'audioformat': 'ogg', 'duration': 193.44, '_id': u'98278775', 'tenantid': 239, 'size': 558034}
[2016-10-07 12:13:28,715: DEBUG/Worker-6] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: Enviando a replicar a google: /home/gc/recordings/239/call_98278775.ogg
[2016-10-07 12:13:28,723: DEBUG/MainProcess] on_sucess() for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] return '0'
[2016-10-07 12:13:28,723: DEBUG/MainProcess] acknowledge(): for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]
[2016-10-07 12:13:28,724: INFO/MainProcess] Task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] succeeded in 2.75686794892s: 0
[2016-10-07 12:13:28,910: ERROR/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: [Error 127] 'WebDav: LocalResourceNotFound()'
[2016-10-07 12:13:28,910: WARNING/Worker-10] recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]: No se pudo replicar: /home/gc/recordings_tmp/239/call_98278775.wav, reintentar
[2016-10-07 12:13:28,912: DEBUG/MainProcess] on_sucess() for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] return <ExceptionInfo: Retry(u'No se pudo replicar', 120)>
[2016-10-07 12:13:28,912: WARNING/MainProcess] on_success() for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] with ExceptionInfo: <ExceptionInfo: Retry(u'No se pudo replicar', 120)>
[2016-10-07 12:13:28,912: WARNING/MainProcess] on_failure(): for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] ExceptionInfo: <ExceptionInfo: Retry(u'No se pudo replicar', 120)>
[2016-10-07 12:13:28,912: WARNING/MainProcess] on_retry(): for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08] ExceptionInfo: <ExceptionInfo: Retry(u'No se pudo replicar', 120)>
[2016-10-07 12:13:28,912: DEBUG/MainProcess] acknowledge(): for task recrepli.tasks.replicate[8935556a-f6c5-4e0e-8709-d5958e437e08]

@cesterlizi
Copy link

Hi
I found that duplication is generated when I publish a task. I this case I'am sending the task to rabbit using JAVA rabbitmq API with message confirmation. For some reason (yet not understood) confirmation fails and I am sending the task again and two task message are enqueued to rabbitmq so two tasks are consumed and executed by celery worker

@ask dismiss my assertion in this issue

Carlos

@Skyross
Copy link

Skyross commented Nov 24, 2016

Hi all

We've found same issue in several our projects when we used Redis as broker but seems it's not broker related problem. It's really looks like (we've checked the logs) any delayed task (ETA used, for example with countdown argument) several workers pulls same task (with same Task ID). And I found a lot of posts in Internet that describes that situation but in different words (post, issue). And there is repository specifically for that case. May be that lock should be implemented at Celery level for any broker?

We using this our snippet to prevent such task behavior:

import datetime

from celery import Task
from celery.utils.log import get_task_logger
from django.conf import settings
from django.core.cache import get_cache

logger = get_task_logger(__name__)


# noinspection PyAbstractClass
class TaskWithLock(Task):
    """
    Base task with lock to prevent multiple execution of tasks with ETA.
    It's happens with multiple workers for tasks with any delay (countdown, ETA).
    You may override cache backend by setting `CELERY_TASK_LOCK_CACHE` in your Django settings file
    """
    abstract = True
    lock_expiration = 60 * 60 * 24  # 1 day

    cache = get_cache(getattr(settings, 'CELERY_TASK_LOCK_CACHE', 'default'))

    @property
    def lock_key(self):
        """Unique string for task as lock key"""
        return '%s_%s' % (self.__class__.__name__, self.request.id)

    def acquire_lock(self):
        """Set lock"""
        result = self.cache.add(self.lock_key, True, self.lock_expiration)
        logger.debug('Lock for %s at %s %s', self.request.id, datetime.datetime.now(), 'succeed' if result else 'failed')
        return result

    def release_lock(self):
        """Release lock"""
        result = self.cache.delete(self.lock_key)
        logger.debug('Lock release for %s at %s %s', self.request.id, datetime.datetime.now(), 'succeed' if result else 'failed')
        return result

    def retry(self, *args, **kwargs):
        """We need to release our lock to let the first process take current task in execution on retry"""
        logger.debug('Retry requested %s, %s', args, kwargs)
        self.release_lock()
        return super(TaskWithLock, self).retry(*args, **kwargs)

    def __call__(self, *args, **kwargs):
        """Checking for lock existence"""
        if self.acquire_lock():
            logger.debug('Task %s execution with lock started', self.request.id)
            return super(TaskWithLock, self).__call__(*args, **kwargs)
        logger.debug('Task %s skipped due lock detection', self.request.id)

Feel free to test and point to any errors in this snippet.

@andynguyen11
Copy link

@Skyross I've been investigating a similar issue this morning as we are using Redis as our broker too. How long are your ETAs set to? We have multiple tasks with ETAs set anywhere in between 1-3 days. I suspect our issue (maybe yours as well) could possibly be this:

http://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html#redis-visibility-timeout

@topalhan
Copy link

@andynguyen11, I am not familiar with Redis but duplicate tasks happen on other brokers for various reasons (see #2976) and this solution would help.

@Skyross - Thanks for this. I see your solution uses django cache, which works for me but I am thinking if we can find a more generic solution. How about checking the list of existing scheduled tasks, which one can retrieve by using celery.control.inspect().scheduled().

Skip if the task is already scheduled e.g.

scheduled = celery.control.inspect().scheduled()
if scheduled and self.request.id in [item['request']['id'] for sublist in celery.control.inspect().scheduled().values()  for item in sublist]:

Also, are you using this for workflows or single tasks ?

@Skyross
Copy link

Skyross commented Nov 30, 2016

Hi all again,

The previous version of my code contains at least one error: all tasks with retry will be skipped too. I'm testing new version of that class. As you may see now we don't release lock but just waiting for its expiration.

Also I created public repository for that sandbox purpose (maybe later I'll move it to GitHub later if needed)

@andynguyen11 Yes, you were right! I came to the same reason. Anyway, sometimes we need to create delay further in the future than the visibility_timeout value.

@topalhan Feel free to provide your version and compare performance 👍

UPD: moved repo to https://gitlab.com/the-past/sandbox/Django-1.8.13-Celery-3.1.23

@auvipy
Copy link
Member

auvipy commented Jan 15, 2018

what's it's situation after 4.1/master?

@Alexeyisme
Copy link

Alexeyisme commented Jan 16, 2018

Hi! I also encountered duplication problem.
We have ETA tasks for 1-4 hours ahead. In the attached log You can see, that task with id 06eaa034-0e0c-49e8-88ec-08279c871cb1 applied twice and executed twice (client device received two same push-notifications).
Already tried huge VISIBILITY_TIMEOUT.

2018-01-15T14:17:01.102524+00:00 app[worker.1]: [2018-01-15 14:17:01,102: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7f05a9003598> (args:('notifications.event_update_tasks.send_event_update_notification', '06eaa034-0e0c-49e8-88ec-08279c871cb1', {'lang': 'py', 'task': 'notifications.event_update_tasks.send_event_update_notification', 'id': '06eaa034-0e0c-49e8-88ec-08279c871cb1', 'eta': '2018-01-15T14:17:00.010508+00:00', 'expires': None, 'group': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '9dd82f0e-9af3-4714-9ac1-f96b994b5d98', 'parent_id': '9dd82f0e-9af3-4714-9ac1-f96b994b5d98', 'argsrepr': '(15446, 163)', 'kwargsrepr': '{}', 'origin': 'gen11@e8457c44-4b6c-4d27-b317-9c6b7cc069d0', 'redelivered': True, 'reply_to': '6d7d0c3d-b5d2-3141-850f-6bf23ac825af', 'correlation_id': '06eaa034-0e0c-49e8-88ec-08279c871cb1', 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}}, b'[[15446, 163], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]', 'application/json', 'utf-8') kwargs:{})
2018-01-15T14:17:01.102886+00:00 app[worker.1]: [2018-01-15 14:17:01,102: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7f05a9003598> (args:('notifications.event_update_tasks.send_event_update_notification', '06eaa034-0e0c-49e8-88ec-08279c871cb1', {'lang': 'py', 'task': 'notifications.event_update_tasks.send_event_update_notification', 'id': '06eaa034-0e0c-49e8-88ec-08279c871cb1', 'eta': '2018-01-15T14:17:00.010508+00:00', 'expires': None, 'group': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '9dd82f0e-9af3-4714-9ac1-f96b994b5d98', 'parent_id': '9dd82f0e-9af3-4714-9ac1-f96b994b5d98', 'argsrepr': '(15446, 163)', 'kwargsrepr': '{}', 'origin': 'gen11@e8457c44-4b6c-4d27-b317-9c6b7cc069d0', 'reply_to': '6d7d0c3d-b5d2-3141-850f-6bf23ac825af', 'correlation_id': '06eaa034-0e0c-49e8-88ec-08279c871cb1', 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}}, b'[[15446, 163], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]', 'application/json', 'utf-8') kwargs:{})
2018-01-15T14:17:01.103941+00:00 app[worker.1]: [2018-01-15 14:17:01,103: DEBUG/ForkPoolWorker-1] send_event_update_notification
2018-01-15T14:17:01.104265+00:00 app[worker.1]: [2018-01-15 14:17:01,104: DEBUG/ForkPoolWorker-3] send_event_update_notification
2018-01-15T14:17:01.104976+00:00 app[worker.1]: [2018-01-15 14:17:01,104: DEBUG/MainProcess] Task accepted: notifications.event_update_tasks.send_event_update_notification[255ce589-f8f7-4b7e-9d4f-aba88aa91038] pid:10
2018-01-15T14:17:01.105307+00:00 app[worker.1]: [2018-01-15 14:17:01,105: DEBUG/MainProcess] Task accepted: notifications.event_update_tasks.send_event_update_notification[06eaa034-0e0c-49e8-88ec-08279c871cb1] pid:12
2018-01-15T14:17:01.105661+00:00 app[worker.1]: [2018-01-15 14:17:01,105: DEBUG/MainProcess] Task accepted: notifications.event_update_tasks.send_event_update_notification[06eaa034-0e0c-49e8-88ec-08279c871cb1] pid:13
2018-01-15T14:17:01.106536+00:00 app[worker.1]: [2018-01-15 14:17:01,104: DEBUG/ForkPoolWorker-4] send_event_update_notification

@auvipy auvipy added this to the v5.0.0 milestone Jan 16, 2018
@martin-xia
Copy link

martin-xia commented Mar 5, 2018

having the same problem here, duplicate task_id sent to multiple workers within 10 secs. Is there any existing solution?

Moreover, in my situation the task has no countdown.

@tagashy
Copy link

tagashy commented Mar 6, 2018

I'm having the same issue duplicate task with duplicate task on redis ...
i solve it with a decorator (but it's more a hack than something correct) based on https://stackoverflow.com/questions/26831103/avoiding-duplicate-tasks-in-celery-broker

from redis import Redis
from functools import wraps

def run_only_one_instance(lock_name):
    def real_decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            print("run only once " + lock_name + str(args) + str(kwargs))
            try:
                sentinel = Redis(host="redis").incr(lock_name + str(args) + str(kwargs))
                if sentinel == 1:
                    # I am the legitimate running task
                    print("running func")
                    return func(*args, **kwargs)
                else:
                    # Do you want to do something else on task duplicate?
                    print("run is already done escaping")
                Redis(host="redis").decr(lock_name + str(args) + str(kwargs))
            except Exception as e:
                Redis(host="redis").decr(lock_name + str(args) + str(kwargs))
                # potentially log error with Sentry?
                # decrement the counter to insure tasks can run
                # or: raise e

        return wrapper

    return real_decorator

hope it will help some people
in case someone need help to use it it's how i use it:

@app.task
@run_only_one_instance("name_"+"get_all_issues")
def get_all_issues():

EDIT i've forget the return on the wrapper part now it's corrected

@auvipy
Copy link
Member

auvipy commented Mar 6, 2018

can this be added in docs? or any patch in upstream?

@tagashy
Copy link

tagashy commented Mar 6, 2018

If you are speaking about my code
Of course it can be added to docs or patch.
But wait till tomorow i will make a better version whithout hardcoded stuff and i may have found a way to remove the lock name variable and replace it with some data from function
Will do it asap :)

@tagashy
Copy link

tagashy commented Mar 7, 2018

so here is a version that is not hardcoded (but it still work only with redis):
lock_name based on module name + function name + args +kwargs (module name is in the case you got two function with the same name in different module who need to run with the same redis)
redis server address is the only parameter and default to 'redis'
i've also added some exception logging with error from the standard logging lib
and finaly i've removed the print

from redis import Redis
from functools import wraps
from logging import error


def run_only_one_instance(redis_addr: str = "redis"):
    def real_decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                sentinel = Redis(host=redis_addr).incr(func.__module__+func.__name__ + str(args) + str(kwargs))
                if sentinel == 1:
                    return func(*args, **kwargs)
                else:
                    pass
                Redis(host=redis_addr).decr(func.__module__+func.__name__ + str(args) + str(kwargs))
            except Exception as e:
                Redis(host=redis_addr).decr(func.__module__+func.__name__ + str(args) + str(kwargs))
                error(e)

        return wrapper

    return real_decorator

i've forgot to say it but actualy it work only on python3 (if you want python 2 support remove the ':str' on the definition of run_only_one_instance)

@auvipy
Copy link
Member

auvipy commented Mar 7, 2018

would you mind sending a PR on master?

@tagashy
Copy link

tagashy commented Mar 7, 2018

done but first time I do a PR.
I hope I have done it correctly.
the file is inside celery/utils/

@auvipy auvipy modified the milestones: v5.0.0, v4.3 May 27, 2018
@auvipy auvipy modified the milestones: 4.6, 4.5 Jul 16, 2019
@auvipy auvipy modified the milestones: 4.5, 4.4.x Dec 16, 2019
@auvipy
Copy link
Member

auvipy commented Dec 16, 2019

may I know its update with celery==4.4.0

@calvin620707
Copy link

calvin620707 commented Feb 29, 2020

I still observer this issue with celery 4.4.0. I will check if I can reproduce this issue with a test case or some simpler reproduce steps.

Everything is fine with

-celery = {extras = ["redis"],version = "==4.2.1"}
-kombu = '==4.2.2.post1'

We notice this issue after doing the upgrade:

+celery = {extras = ["redis"],version = "==4.3.0"}
+kombu = '==4.4.0'

and the issue persists with

celery = {extras = ["redis"],version = "==4.4.0"}
"kombu": {
    "hashes": [
        "sha256:2a9e7adff14d046c9996752b2c48b6d9185d0b992106d5160e1a179907a5d4ac",
        "sha256:67b32ccb6fea030f8799f8fd50dd08e03a4b99464ebc4952d71d8747b1a52ad1"
    ],
    "version": "==4.6.7"
},

Update on 3/3:
The issue seems gone after downgrade to 'celery[redis]==4.2.1' 'kombu==4.2.2.post1'.

Update on 3/6:
Found a lot of data in Redis' unacked key. Here are some of them.

 '12fb3ab3-d784-42eb-b8ce-f2a4dedf86db': '[{"body": "gAIoSmkS5ABN+wdVMEZOM3kxQmdHT1phd3d4YVlxcF92U0YwNVJ5YWdmSDFHM0tXZkVZMDVEV18xS0NBcHEBfXECKFUHYWNjb3VudEppEuQAVQxzdWJzY3JpcHRpb25Kj+sDAHV0cQN9fXEEKFgFAAAAY2hvcmRxBU5YCQAAAGNhbGxiYWNrc3EGTlgIAAAAZXJyYmFja3NxB05YBQAAAGNoYWlucQhOdYdxCS4=", "headers": {"origin": "gen23299@test-celery-worker-1", "lang": "py", "task": "foo.celery.tasks.credentials.requests.notify_webhook", "group": null, "root_id": "0eb82b08-5318-40bb-92b6-a19095a79ab2", "expires": null, "retries": 11, "timelimit": [null, null], "argsrepr": "(14946921, 2043, \'FN3y1BgGOZawwxaYqp_vSF05RyagfH1G3KWfEY05DW_1KCAp\', {\'account\': 14946921, \'subscription\': 256911})", "eta": "2020-03-06T09:51:14.024135+00:00", "parent_id": "a875de24-b48f-459e-bae4-173e3c85b387", "id": "bae2803a-ec5c-4ca7-8ed2-2ce2b755265e", "redelivered": true, "shadow": null, "kwargsrepr": "{}"}, "content-type": "application/x-python-serialize", "properties": {"origin": "gen23298@test-celery-worker-1", "body_encoding": "base64", "delivery_info": {"routing_key": "celery-bg", "exchange": ""}, "delivery_mode": 1, "priority": 0, "correlation_id": "bae2803a-ec5c-4ca7-8ed2-2ce2b755265e", "reply_to": "cd5a7d96-b46d-3446-ae18-a3010a2e838a", "delivery_tag": "12fb3ab3-d784-42eb-b8ce-f2a4dedf86db"}, "content-encoding": "binary"}, "", "celery-bg"]',
 '07d5c057-40e8-4aa9-9a96-23f14a11307d': '[{"body": "gAIoSmkS5ABN+wdVMEZOM3kxQmdHT1phd3d4YVlxcF92U0YwNVJ5YWdmSDFHM0tXZkVZMDVEV18xS0NBcHEBfXECKFUHYWNjb3VudEppEuQAVQxzdWJzY3JpcHRpb25Kj+sDAHV0cQN9fXEEKFgFAAAAY2hvcmRxBU5YCQAAAGNhbGxiYWNrc3EGTlgIAAAAZXJyYmFja3NxB05YBQAAAGNoYWlucQhOdYdxCS4=", "headers": {"origin": "gen23299@test-celery-worker-1", "lang": "py", "task": "foo.celery.tasks.credentials.requests.notify_webhook", "group": null, "root_id": "0eb82b08-5318-40bb-92b6-a19095a79ab2", "expires": null, "retries": 11, "timelimit": [null, null], "argsrepr": "(14946921, 2043, \'FN3y1BgGOZawwxaYqp_vSF05RyagfH1G3KWfEY05DW_1KCAp\', {\'account\': 14946921, \'subscription\': 256911})", "eta": "2020-03-06T09:50:17.118106+00:00", "parent_id": "a875de24-b48f-459e-bae4-173e3c85b387", "id": "bae2803a-ec5c-4ca7-8ed2-2ce2b755265e", "redelivered": true, "shadow": null, "kwargsrepr": "{}"}, "content-type": "application/x-python-serialize", "properties": {"origin": "gen23298@test-celery-worker-1", "body_encoding": "base64", "delivery_info": {"routing_key": "celery-bg", "exchange": ""}, "delivery_mode": 1, "priority": 0, "correlation_id": "bae2803a-ec5c-4ca7-8ed2-2ce2b755265e", "reply_to": "cd5a7d96-b46d-3446-ae18-a3010a2e838a", "delivery_tag": "07d5c057-40e8-4aa9-9a96-23f14a11307d"}, "content-encoding": "binary"}, "", "celery-bg"]',
 '5c840cb9-6e20-4958-8f58-1a3502c3f8cb': '[{"body": "gAIoVSAxNTFlYWNkNDM5NDE0YjZjYjg3MGQyMGI3NzJjYTYyOHEBTfsHVQRmYWtlcQJ9cQNVA2Zvb1UaMjAyMC0wMy0wNlQwOTozMjozNi4xNTQ5OTlzdHEEfX1xBShYBQAAAGNob3JkcQZOWAkAAABjYWxsYmFja3NxB05YCAAAAGVycmJhY2tzcQhOWAUAAABjaGFpbnEJTnWHcQou", "headers": {"origin": "gen23299@test-celery-worker-1", "lang": "py", "task": "foo.celery.tasks.credentials.requests.notify_webhook", "group": null, "root_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "expires": null, "retries": 11, "timelimit": [null, null], "argsrepr": "(\'151eacd439414b6cb870d20b772ca628\', 2043, \'fake\', {\'foo\': \'2020-03-06T09:32:36.154999\'})", "eta": "2020-03-06T09:47:22.954949+00:00", "parent_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "redelivered": true, "shadow": null, "kwargsrepr": "{}"}, "content-type": "application/x-python-serialize", "properties": {"origin": "gen23299@test-celery-worker-1", "body_encoding": "base64", "delivery_info": {"routing_key": "celery-bg", "exchange": ""}, "delivery_mode": 1, "priority": 0, "correlation_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "reply_to": "4f0b394f-dc4e-37bc-9259-b0a1f3400d81", "delivery_tag": "5c840cb9-6e20-4958-8f58-1a3502c3f8cb"}, "content-encoding": "binary"}, "", "celery-bg"]',
 '9e9eb118-4799-4224-8640-607c0d4189fc': '[{"body": "gAIoVSAxNTFlYWNkNDM5NDE0YjZjYjg3MGQyMGI3NzJjYTYyOHEBTfsHVQRmYWtlcQJ9cQNVA2Zvb1UaMjAyMC0wMy0wNlQwOTozMjozNi4xNTQ5OTlzdHEEfX1xBShYBQAAAGNob3JkcQZOWAkAAABjYWxsYmFja3NxB05YCAAAAGVycmJhY2tzcQhOWAUAAABjaGFpbnEJTnWHcQou", "headers": {"origin": "gen23299@test-celery-worker-1", "lang": "py", "task": "foo.celery.tasks.credentials.requests.notify_webhook", "group": null, "root_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "expires": null, "retries": 12, "timelimit": [null, null], "argsrepr": "(\'151eacd439414b6cb870d20b772ca628\', 2043, \'fake\', {\'foo\': \'2020-03-06T09:32:36.154999\'})", "eta": "2020-03-06T09:56:23.530909+00:00", "parent_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "shadow": null, "id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "kwargsrepr": "{}"}, "content-type": "application/x-python-serialize", "properties": {"origin": "gen23298@test-celery-worker-1", "body_encoding": "base64", "delivery_info": {"routing_key": "celery-bg", "exchange": ""}, "delivery_mode": 1, "priority": 0, "correlation_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "reply_to": "4f0b394f-dc4e-37bc-9259-b0a1f3400d81", "delivery_tag": "9e9eb118-4799-4224-8640-607c0d4189fc"}, "content-encoding": "binary"}, "", "celery-bg"]',
 '44a7f92c-5ad7-47e3-8af0-4067c3cbc446': '[{"body": "gAIoVSAxNTFlYWNkNDM5NDE0YjZjYjg3MGQyMGI3NzJjYTYyOHEBTfsHVQRmYWtlcQJ9cQNVA2Zvb1UaMjAyMC0wMy0wNlQwOTozMjozNi4xNTQ5OTlzdHEEfX1xBShYBQAAAGNob3JkcQZOWAkAAABjYWxsYmFja3NxB05YCAAAAGVycmJhY2tzcQhOWAUAAABjaGFpbnEJTnWHcQou", "headers": {"origin": "gen23298@test-celery-worker-1", "lang": "py", "task": "foo.celery.tasks.credentials.requests.notify_webhook", "group": null, "root_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "expires": null, "retries": 12, "timelimit": [null, null], "argsrepr": "(\'151eacd439414b6cb870d20b772ca628\', 2043, \'fake\', {\'foo\': \'2020-03-06T09:32:36.154999\'})", "eta": "2020-03-06T09:53:20.107464+00:00", "parent_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "shadow": null, "redelivered": true, "kwargsrepr": "{}"}, "content-type": "application/x-python-serialize", "properties": {"origin": "gen23299@test-celery-worker-1", "body_encoding": "base64", "delivery_info": {"routing_key": "celery-bg", "exchange": ""}, "delivery_mode": 1, "priority": 0, "correlation_id": "4cd3108b-f75a-442f-9e3a-477576f65af0", "reply_to": "4f0b394f-dc4e-37bc-9259-b0a1f3400d81", "delivery_tag": "44a7f92c-5ad7-47e3-8af0-4067c3cbc446"}, "content-encoding": "binary"}, "", "celery-bg"]',

It seems like the issue happens after upgrade kombu from 4.2.2.post1 to 4.3.0.

Used git bisect locate the first bad commit is a307726faa8b64a71e04932b762498f51c1d7208. This may be relavant because we use Redis + celery gevent thread and the changes seems also relavant.

git bisect start
# bad: [b4be5cf2e7e5ca538246e254bc6ebac5d6c5f477] Bump version: 4.2.2 → 4.3.0
git bisect bad b4be5cf2e7e5ca538246e254bc6ebac5d6c5f477
# good: [06ee7855814f66280b1b3277b0ac7ff4d8f3e76d] Mention 4.2.2-post1 in changelog.
git bisect good 06ee7855814f66280b1b3277b0ac7ff4d8f3e76d
# good: [c627c6d9d1f29ae0bcb03931d5f24001e60f72a9] Bump version: 4.2.0 → 4.2.1
git bisect good c627c6d9d1f29ae0bcb03931d5f24001e60f72a9
# bad: [c48d201ee2afe80cc6a30d9e58beb982d20fc220] Allow setting boto3.sqs.create_queue Attributes via transport_options (#957)
git bisect bad c48d201ee2afe80cc6a30d9e58beb982d20fc220
# bad: [62087a67cfbae4e87718fbb917f5140aa66e8287] Fix a deprecation warning about logger.warn() (#924)
git bisect bad 62087a67cfbae4e87718fbb917f5140aa66e8287
# good: [87d6423289caa776fc10a8d7cdb3d4b741460cfa] fix/Broadcast-fail-by-give-queue-kwargs (#906)
git bisect good 87d6423289caa776fc10a8d7cdb3d4b741460cfa
# bad: [200a60a228bd836542e9e9a2759af1d915139d1c] Add support for 'rediss' scheme for secure redis connections. (#910)
git bisect bad 200a60a228bd836542e9e9a2759af1d915139d1c
# bad: [9e64dafa617d4403edfa07e7ead015ce081046c0] Allow SimpleQueue to pass 'queue_arguments' to Queue object. (#900)
git bisect bad 9e64dafa617d4403edfa07e7ead015ce081046c0
# bad: [a307726faa8b64a71e04932b762498f51c1d7208] on worker restart - restore visible regardless to time (#905)
git bisect bad a307726faa8b64a71e04932b762498f51c1d7208
# first bad commit: [a307726faa8b64a71e04932b762498f51c1d7208] on worker restart - restore visible regardless to time (#905)

Alright. My case already be mentioned in celery/kombu#1098.

@noamkush
Copy link
Contributor

noamkush commented Jun 8, 2020

@auvipy @thedrow
I encountered the bug and managed to reproduce it (Celery 4.4.5 + SQS) with a fairly simple test case:
Task code:

from celery import Celery

app = Celery('proj', broker='sqs://')


@app.task()
def run_once():
    print('Hi')

    return 1

The calling code:

In [2]: tasks.run_once.apply_async(countdown=1830)
Out[2]: <AsyncResult: 71153f8e-1fae-4cee-9923-4753ec0f2d45>

Celery log:

[2020-06-08 14:21:38,328: INFO/MainProcess] Received task: proj.tasks.run_once[71153f8e-1fae-4cee-9923-4753ec0f2d45]  ETA:[2020-06-08 11:52:06.057666+00:00]
[2020-06-08 14:51:40,114: INFO/MainProcess] Received task: proj.tasks.run_once[71153f8e-1fae-4cee-9923-4753ec0f2d45]  ETA:[2020-06-08 11:52:06.057666+00:00]
[2020-06-08 14:52:06,967: WARNING/ForkPoolWorker-2] Hi
[2020-06-08 14:52:06,968: WARNING/ForkPoolWorker-3] Hi
[2020-06-08 14:52:06,972: INFO/ForkPoolWorker-2] Task proj.tasks.run_once[71153f8e-1fae-4cee-9923-4753ec0f2d45] succeeded in 0.006056700000044657s: 1
[2020-06-08 14:52:06,972: INFO/ForkPoolWorker-3] Task proj.tasks.run_once[71153f8e-1fae-4cee-9923-4753ec0f2d45] succeeded in 0.005694999999832362s: 1

You need to run the celery worker before calling the task and you must leave it running until the task runs.

@Sprungwunder
Copy link

@noamkush Interesting to see the timestamps, looks like the task is accepted first and then maybe re-scheduled after 30 minutes? And then executed twice?

@auvipy auvipy removed their assignment Jun 8, 2020
@auvipy
Copy link
Member

auvipy commented Jun 8, 2020

fixes are welcome

@noamkush
Copy link
Contributor

noamkush commented Jun 9, 2020

So I figured out this is not an actual bug but SQS behavior. 30 minutes is celery's default visibility timeout and the message was simply redelivered to celery. Which brings me to the following:

  1. Documentation is wrong, it states the default visibility timeout is 30 seconds, which is in fact AWS's default but not Celery's (I'll maybe open a PR later).
  2. Maybe this is preventable? Shouldn't SQS broker periodically extend visibility timeout?

@auvipy
Copy link
Member

auvipy commented Jun 10, 2020

So I figured out this is not an actual bug but SQS behavior. 30 minutes is celery's default visibility timeout and the message was simply redelivered to celery. Which brings me to the following:

1. Documentation is wrong, it states the default visibility timeout is 30 seconds, which is in fact AWS's default but not Celery's (I'll maybe open a PR later).

2. Maybe this is preventable? Shouldn't SQS broker periodically extend visibility timeout?

both can be improved :) looking forward to your PR

@auvipy
Copy link
Member

auvipy commented Feb 25, 2021

can you check this celery/kombu#1098 (comment)

@noamkush
Copy link
Contributor

The task is received from sqs but isn't acked since there's a countdown. So celery waits till countdown is over but doesn't ack and meanwhile the visibility timeout expires and the task is redelivered (this is actually noted in caveats). What I suggested is that the celery process that is waiting for countdown would extend the visibility timeout so it won't be redelivered.

@rui-ren
Copy link

rui-ren commented Sep 8, 2021

@noamkush Interesting to see the timestamps, looks like the task is accepted first and then maybe re-scheduled after 30 minutes? And then executed twice?

If you have any insight for this case. How to solve it? I got the same issue. I only have one job and it is accepted on one instance and 1800 seconds later another job is triggered to another instance. Because I was query databases and it crashes all my jobs.

@auvipy
Copy link
Member

auvipy commented Sep 8, 2021

https://github.com/celery/celery/pull/6651/files

@rui-ren
Copy link

rui-ren commented Sep 8, 2021

https://github.com/celery/celery/pull/6651/files

Thank you ! I am using rabbitmq and celery==4.2.0 , all the questions were for redis, do you have any good links for rabittmq setup?

broker_url = '{}/{}'.format(
   rabbitmq_server,
   os.getenv('rabbitmq_vhost'),
)
backend = 'amqp'

@auvipy
Copy link
Member

auvipy commented Sep 9, 2021

https://github.com/celery/celery/pull/6651/files

Thank you ! I am using rabbitmq and celery==4.2.0 , all the questions were for redis, do you have any good links for rabittmq setup?

broker_url = '{}/{}'.format(
   rabbitmq_server,
   os.getenv('rabbitmq_vhost'),
)
backend = 'amqp'

that is not a supported version.

@rui-ren
Copy link

rui-ren commented Sep 10, 2021

https://github.com/celery/celery/pull/6651/files

Thank you ! I am using rabbitmq and celery==4.2.0 , all the questions were for redis, do you have any good links for rabittmq setup?

broker_url = '{}/{}'.format(
   rabbitmq_server,
   os.getenv('rabbitmq_vhost'),
)
backend = 'amqp'

that is not a supported version.
I only have only one task, but it got triggered after 1800 second on the other worker. I have no idea why this happens. Do you have any insight? Thanks
I am using job = chain(single_job), but i only have one single_job
job() starting the job.

mysql> show processlist;
+-------+------+---------------+-------------------+---------+------+-----------+
| Id    | User | Host          | db                | Command | Time | State     |
+-------+------+---------------+-------------------+---------+------+-----------+
| 97189 | clp  | 172.11.17.202 | database | Query   |    0 | init      |
| 97488 | clp  | 172.11.11.252 | database | Query   | 1505 | executing |
| 97489 | clp  | 172.11.11.252 | database | Sleep   | 1851 |           |
| 97543 | clp  | 172.21.6.242  | database | Query   |   51 | updating  |
| 97544 | clp  | 172.21.6.242  | database | Sleep   |   51 |           |
+-------+------+---------------+-------------------+---------+------+-----------+

Updated celery==4.4.2 and cancel retry, bug solved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment