Ansible Hangs Filament
- 1 Purpose
- 2 What is a hang?
- 3 Layers
- 4 Tools By Layer
- 5 Labs
- 6 Special Topics
- 7 Resources
Use this page and it's set of exercises to familiarize yourself with how to narrow the scope of an ansible related "hang" and methods to introspect the possible causes.
What is a hang?
Any situation where ansible is "stuck" or not executing tasks. It's important to understand that "long running" and "stuck" aren't really the same thing. Long running tasks should normally be wrapped with `async` at the very least to inform end users that the remote pids are still working. If those pids go into infinite wait or polling states, then a "hang" might be happening.
ansible and ansible-playbook have a consistent workflow which might be thought of a layered model. Each of these layers is capable of causing a situation that appears to be a "hang" and each of these layers has methods and tools to determine if the hang has occurred there.
- `ansible-playbook` or `ansible`
- python: task executor
- python: action plugin
- python: connection plugin
- <-- network -->
- subprocess forks or python lib calls
Tools By Layer
The primary strategy with any "hang" is to eliminate any variables per Occam's razor. If each layer in the stack could have caused the hang, try to remove or rule out each of the layers from the equation. For example: if the remote python command is still running on the target, assume that "executor" is not at fault and skip to troubleshooting directly on the target instead of the controller.
Similar to how a person might troubleshoot electronics, you are measuring inputs and outputs of each component. If the component receives the correct input and send the right output, you can move on to the next component in the change.
If ansible is hanging on initial load of a playbook, check the ansible --version output to verify that ansible was installed correctly.
If ansible is stalled without any indication that tasks are executing on the remote machines, check the ps aux or top outputs for signs of high cpu or memory consumption.
Executor is a large swath of python code that involves queuing, forking and serialization. If you suspect that executor is hanging, there are a couple ways to introspect it:
- ansible[-playbook] -vvvv
- adding print statements or python-q at various points in the code.
Using -vvvv is always the first step in troubleshooting any ansible issue. If -vvvv shows that tasks are starting on each host, then executor is probably working correctly. Further introspection from ANSIBLE_DEBUG=1 will allow you to see what is going on in the queuing/iteration code.
-vvvv and ANSIBLE_DEBUG=1 and print|python-q injection are again the way to introspect action plugins.
-vvvv and ANSIBLE_DEBUG=1 and print|python-q injection are yet again the way to introspect connection plugins.
sshpass is a simplistic command that works similar to tcl/expect with a focus on checking for a prompt that ends with "assword:".
In the ssh connection plugin we configure sshpass to use a file descriptor for the password instead of a cli option.
if self._play_context.password: if not self._sshpass_available(): raise AnsibleError("to use the 'ssh' connection type with passwords, you must install the sshpass program") self.sshpass_pipe = os.pipe() b_command += [b'sshpass', b'-d' + to_bytes(self.sshpass_pipe, nonstring='simplerepr', errors='surrogate_or_strict')]
In certain environments the ssh clients are configured to have non-standard credential prompts, such as two-factor auth or something that doesn't end in 'assword:'. The behavior in those cases is that sshpass loops continuously looking for a set of characters that never shows up. If you suspect this to be the case, run the full ansible ssh command directly without sshpass to confirm. Use -vvvv to see that command.
Another complication is sshkeys with passphrases. sshpass can not handle these and will hang indefinitely: https://sourceforge.net/p/sshpass/patches/10/
https://github.com/ansible/ansible/issues/63871 In scale testing, we have noticed that sshpass can spin forever with a zombie ssh child process. There is no solution to this currently except to use a custom connection plugin to reap the pids: https://github.com/jctanner/ansible-scale-lab/blob/master/connection_plugins/ssh_killer.py and that should not be used in production.
ssh / sftp / scp client
SSH debugging always starts by running ansible with -vvvv. There is no reason to begin looking at anything else until you capture the -vvvv output. The reasons are:
- To get a copy/paste'able command from the ansible log which can be run directly without ansible to reproduce the problem
- To see the verbose ssh logs and verify communication is happening across the wire.
It is unlikely that you'll have to dig deeper with the ssh client, but strace is your friend in those cases.
Use tcpdump to verify that traffic is being sent and received from both sides of the wire. You won't be able to decrypt the traffic, but that is unimportant.
The ssh daemon is like any other process when it comes to debugging. It runs via an init system, takes in data and forks off other processes.
Things to check for with sshd:
- daemon restarts in syslog
- firewall rules
- selinux rules
- ip routing
- auth/secure log
Since this document focuses on "hangs", most of the above items should be inspected with the mindset that you are looking for changes that occurred very recently.
- lsof -i :22
Shell plugins are obscured from the end user typically unless they happen to use something like the fish shell. When a module executes, the -vvvv output will normally show the remote command as `/bin/sh -c "/usr/bin/python ..."` or something very similar. The `/bin/sh -c` part comes from the shell plugin. If you suspect a hang in this layer, strace is your best bet.
The become methods attempt to wrap remote execution with a privilege escalation tool (sudo most often). Always check the -vvvv output first when looking for become issues.
<el7host> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o Port=2222 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=vagrant -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/b0641386d8 -tt el7host '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-povetpgwbtmdinygjzgmlzlmxiqpbeau; /usr/bin/python /home/vagrant/.ansible/tmp/ansible-tmp-1517284973.58-69017043183620/command.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
In ansible <= 2.4, you will have to look at the ANSIBLE_DEBUG output to verify the escalation succeeded. In 2.5+ the same data is now in the -vvvv output ...
You are looking for something such as "Escalation succeeded" in the output right after the previous ssh command example. If you see that message, privilege escalation is not hung.
If the message is not seen and you are using sudo without a password, try disabling -tt with the use_tty option in ansible.cfg....
python has a tracing library, but it's not usable once the process has been started without it. strace and ltrace are the tools of choice for basic tracing of an already running interpreter.
Another strategy involves trying to execute the module directly without ansible via the instructions in the doc link below. If reproducible, "python -m trace" or any tracing tools could be used to find where the code is hanging.
Modules that call system commands should always use run_command from module utils, which is a wrapper around subprocess.Popen. Depending on the module, the command could almost be anything. You'll want to check the ps -ef output and find the child pids for the module script.
Tracing forked commands typically starts with strace, unless you know more information about the command being run and where it logs to. In the strace output, trying checking what the process is polling on or trying to open. If the process has been running too long, you won't be able to see what the polled handle is actually for (file/network/etc). Take a look in /proc/<pid>/fd for further clues.
A strategy here is to check if the command will hang when called directly too. Eliminating ansible from the picture opens the door for contacting other support groups.
Modules import and call into other python libraries. A common place where this can hang is an infinite polling loop on some web resource that can't be resolved and has no timeout decorator.
walking the stack
# inventory.ini el7host ansible_ssh_host=x.x.x.x ansible_ssh_port=22 ansible_ssh_pass=vagrant # site.yml - hosts: el7host user: vagrant gather_facts: False tasks: - name: do something bad!!! shell: while true; do echo "polling"; sleep 5; done; become: True
Setting up the environment:
- create two linux virtual machines (centos 7 preferably)
- create a vagrant user on the second machine with a password of "vagrant" and full passwordless sudo privileges
- install ansible and sshpass on the first machine
- create an inventory.ini as show above
- create the site.yml as shown above
- run: ansible-playbook -vvvv -i inventory site.yml
Answer these questions:
- How many lines of output show up on the screen after the line that starts with "TASK [do something bad!!!]"
- What verbs show up after the word <el7host> on some of the lines?
- How was the module file sent to the remote host?
- Where was the module file put?
- What was done to the file before it was executed?
- What is the final ssh command used before the playbook hangs?
- What is the path for the control persist socket file?
- What is the last line output?
- What does the process tree look like on the controller?
- What does the process tree look like on the remote node?
- What system calls does strace show that the remote process is making?
- What happens when you stop the sshd service on the remote node? [restart the daemon and playbook afterwards]
- What happens when you add a firewall rule on the remote node to block ssh traffic from the controller? [clear the rule afterwards]
- What happens if you delete the control persist socket file? [restart the playbook afterwards]
- What is different in the logs when `pipelining = True` is set in ansible.cfg? [make sure you add it under the ssh_connection section]
On the remote host, set MaxStartups to 1:1:1 in /etc/ssh/sshd_config, then restart sshd. On the controller, create an inventory with 10 hosts and use ansible_ssh_host to have them all use the same machine. On the controller, create a playbook with a single task to sleep for 10 seconds.
- Do all tasks succeed?
- What failure messages do you see?
- What are the relevant entries in /var/log/messags and /var/log/secure on the remote node?
On a remote host, create the following sshrc file ...
[vagrant@el7host ~]$ cat /etc/ssh/sshrc echo "$(date) starting session" >> /tmp/ssh.log echo "$(date) sleeping for a while" >> /tmp/ssh.log sleep 10000 echo "$(date) sleep done" >> /tmp/ssh.log
Use the ping module to test connection to the machine.
- How long does the ping take to complete?
- What errors or messages appear in the -vvvv output?
- What does the process tree look like on the remote node?
- What syscalls are being made from the remote processes?
- Has python started to execute on the remote node?
- pending transactions
- forking in postscript
- repoquery stuck
- rpm database corruption hangs yum
- repository unreachable
- command never terminates
- RO var mount
- use async and move on
- yum locks