CloudWatchLogHandler deadlocks when used with PYTHONBREAKPOINT=ipdb.set_trace #197

dreid · 2024-04-04T16:36:31Z

Versions:

watchtower==3.1.0
ipython==8.23.0
ipdb==0.13.13
python 3.11

Reproduction

The following script will hang when run with PYTHONBREAKPOINT=ipdb.set_trace

from __future__ import annotations

from logging import INFO, LogRecord

from watchtower import CloudWatchLogHandler

log_record = LogRecord(
    name="",
    level=INFO,
    pathname="",
    lineno=0,
    msg="ok",
    args={},
    exc_info=None,
    func="",
    sinfo="",
)

handler = CloudWatchLogHandler()

print("sending messages")
handler.emit(log_record)

print("breaking")
breakpoint()

The running process has the following stacktraces:

💤 Process 56995 🧵 Thread 0:6155792384

    Thread._bootstrap (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:995)
    Thread._bootstrap_inner (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:1038)
    Thread.run (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:975)
    CloudWatchLogHandler._dequeue_batch (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/watchtower/__init__.py:470)
    CloudWatchLogHandler._submit_batch (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/watchtower/__init__.py:344)
    ClientCreator._create_api_method.<locals>._api_call (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/client.py:553)
    BaseClient._make_api_call (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/client.py:946)
    BaseClient._emit_api_params (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/client.py:1072)
    EventAliaser.emit (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/hooks.py:413)
    HierarchicalEmitter.emit (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/hooks.py:257)
    HierarchicalEmitter._emit (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/botocore/hooks.py:238)
    Logger.debug (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:1476)
    Logger.isEnabledFor (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:1744)
    _acquireLock (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:235)


💤 Process 56995 🧵 Thread 0:7911054400

    <module> (/Users/dreid/src/pilot/zapgram/t.py:25)
    set_trace (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/ipdb/__main__.py:77)
    _init_pdb (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/ipdb/__main__.py:54)
    _get_debugger_cls (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/ipdb/__main__.py:34)
    catch_config_error.<locals>.inner (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:118)
    TerminalIPythonApp.initialize (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/IPython/terminal/ipapp.py:269)
    catch_config_error.<locals>.inner (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:118)
    BaseIPythonApplication.initialize (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/IPython/core/application.py:480)
    catch_config_error.<locals>.inner (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:118)
    Application.parse_command_line (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:880)
    Application._create_loader (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:791)
    TraitType.__get__ (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/traitlets.py:687)
    TraitType.get (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/traitlets.py:653)
    HasTraits._notify_observers (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/traitlets.py:1568)
    Application._observe_logging_default (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:287)
    Application._configure_logging (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/traitlets/config/application.py:293)
    dictConfig (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/config.py:823)
    DictConfigurator.configure (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/config.py:546)
    _clearExistingHandlers (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/config.py:285)
    shutdown (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/logging/__init__.py:2192)
    CloudWatchLogHandler.flush (/Users/dreid/src/pilot/zapgram/.ve/lib/python3.11/site-packages/watchtower/__init__.py:492)
    Queue.join (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/queue.py:90)
    Condition.wait (/Users/dreid/.local/share/mise/installs/python/3.11.5/lib/python3.11/threading.py:320)

What is going on?

The top thread is the handler thread dequeuing messages
And the bottom is the main thread.

The main thread is waiting for the queue to be emptied,
And the handler thread is waiting to acquire the global lock in the logging, which is already held by DictConfigurator.configure in the main thread.

The handler thread has to acquire the global lock in Logger.isEnabledFor because importing ipdb as a side effect calls Logger.setLevel(…) which invalidates the enabled level cache of all Logger instances, and then calls logging.config.dictConfig which tries to shutdown all existing handlers.

This can also be reproduced if breakpoint() is replaced by:

print("setting level")
logger.setLevel(INFO)

print("changing logging config")
dictConfig({"version": 1})

This is results in a pretty straightforward deadlock, the handler thread is not able to actually send the logs to cloudwatch so it's never able to empty the queue.

Solutions?

The current behavior of flush is that it waits for the queue to be empty using queue.Queue.join

This doesn't take a timeout, however it is implemented as:

    def join(self):
        with self.all_tasks_done:
            while self.unfinished_tasks:
                self.all_tasks_done.wait()

and queue.Queue.all_tasks_done is a threading.Condition and threading.Condition.wait does take a timeout. So it is possible to implement a join that does timeout as well.

flush could also be modified to send (self.FLUSH, flush_condition) for each queue, and then wait on then call flush_condition.wait(timeout) to wait for the specific flush message to have been processed.

This changes the behavior of flush under load though, from waiting until the queue is empty to waiting until a specific flush is processed, if you have a system where flush is called from multiple threads each would only wait until it's flush message was processed and then proceed.

But this might be the more straightforward behavior as it is could be counter intuitive given a queue state of: msg1, msg2, msg3, flush, msg4, msg5 for flush to block until msg5 is delivered.

Once flush/close are able to timeout at all, they could also be taught to check if self.sequence_tokens[log_stream_name] has changed in the timeout case, allowing them to wait as long as batches are still able to be submitted.

The text was updated successfully, but these errors were encountered:

kislyuk · 2024-04-14T23:23:35Z

Thanks for your proposed solution! I am definitely interested in setting timeouts and implementing it as you describe.

kislyuk · 2024-06-27T15:07:37Z

The queue wait timeout was implemented in ebf368f and released in v3.2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CloudWatchLogHandler deadlocks when used with PYTHONBREAKPOINT=ipdb.set_trace #197

CloudWatchLogHandler deadlocks when used with PYTHONBREAKPOINT=ipdb.set_trace #197

CloudWatchLogHandler deadlocks when used with PYTHONBREAKPOINT=ipdb.set_trace #197

CloudWatchLogHandler deadlocks when used with PYTHONBREAKPOINT=ipdb.set_trace #197

Comments

Versions:

Reproduction

What is going on?

Solutions?