Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File copy hangs until timeout is reached regardless of transfer time #736

Open
xibriz opened this issue Aug 16, 2023 · 4 comments
Open

File copy hangs until timeout is reached regardless of transfer time #736

xibriz opened this issue Aug 16, 2023 · 4 comments
Assignees

Comments

@xibriz
Copy link

xibriz commented Aug 16, 2023

SUMMARY

nxos_file_copy hangs until timeout is reached regadless of how long time the file copy operation takes.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

nxos_file_copy

ANSIBLE VERSION
ansible [core 2.13.11]
  python version = 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True
COLLECTION VERSION
Collection Version
---------- -------
cisco.nxos 5.1.0 
CONFIGURATION
DEFAULT_FILTER_PLUGIN_PATH(/home/ubuntu/automation/deploy-new-site/ansible/ansible.cfg) = ['/home/ubuntu/automation/deploy-new-site/ansible/filter_plugins']
DEFAULT_LOG_PATH(/home/ubuntu/automation/deploy-new-site/ansible/ansible.cfg) = /home/ubuntu/automation/deploy-new-site/ansible/debug_output.txt
PERSISTENT_COMMAND_TIMEOUT(/home/ubuntu/automation/deploy-new-site/ansible/ansible.cfg) = 600
PERSISTENT_CONNECT_TIMEOUT(/home/ubuntu/automation/deploy-new-site/ansible/ansible.cfg) = 600
OS / ENVIRONMENT
Software
  BIOS: version 01.03
  NXOS: version 10.1(1)
  BIOS compile time:  11/20/2020
  NXOS image file is: bootflash:///nxos.10.1.1.bin
  NXOS compile time:  2/14/2021 15:00:00 [02/15/2021 01:41:40]

Hardware
  cisco Nexus9000 C93180YC-FX3 Chassis 
  Intel(R) Xeon(R) CPU D-1526 @ 1.80GHz with 32823072 kB of memory.
  Processor Board ID FDO2550089E
  Device name: tos3.ksat.core-sw-1.c22
  bootflash:  115805708 kB
STEPS TO REPRODUCE

Initiate a file copy from the device using SCP.

- name: initiate file copy from device (will take 10 minutes)
  cisco.nxos.nxos_file_copy:
    file_pull: true
    local_file: nxos64-cs.10.2.5.M.bin
    local_file_directory: /
    remote_file: /tmp/nxos64-cs.10.2.5.M.bin
    remote_scp_server: 192.168.1.10
    remote_scp_server_user: admin
    remote_scp_server_password: admin
    vrf: management
    # This is deprecated but we have to set it because it overrides values in ansible.cfg untill removed in 2024-06-01
    file_pull_timeout: 600
  # We need to ignore erros because this can time out before the copy is complete
  ignore_errors: true

# Check if we can connect
- name: get current version
  cisco.nxos.nxos_command:
    commands:
      - command: show version
        output: text
    wait_for: result[0] contains Cisco
EXPECTED RESULTS

I expect the task to return a result when the file copy is done, but my observations show that the task will "hang" for the total amount of the timeout value (in this case 600sec/10min).

I have observed a timeout error after 10 minutes even tough the file copy is finished below 5 minutes.

When the timeout is reached, the result could be successfull/return the expected result, or it could be a timeout error. If the result is a timeout error following tasks will fail with a AnsibleConnectionFailure saying something like "cli prompt is not identified from the last received response window".

ACTUAL RESULTS

I observe the following results from the first task:

FAILED! => {
    "changed": false,
    "module_stderr": "command timeout triggered, timeout value is 600 secs.\nSee the timeout setting options in the Network Debug and Troubleshooting Guide.",
    "module_stdout": "",
    "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error"
}

And then the next task will fail with:

  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/utils/jsonrpc.py", line 46, in handle_request
    result = rpc_method(*args, **kwargs)
  File "/home/ubuntu/.ansible/collections/ansible_collections/cisco/nxos/plugins/cliconf/nxos.py", line 275, in run_commands
    out = self.send_command(**cmd)
  File "/home/ubuntu/.ansible/collections/ansible_collections/ansible/netcommon/plugins/plugin_utils/cliconf_base.py", line 142, in send_command
    resp = self._connection.send(**kwargs)
  File "/home/ubuntu/.ansible/collections/ansible_collections/ansible/netcommon/plugins/connection/network_cli.py", line 338, in wrapped
    self.update_cli_prompt_context()
  File "/home/ubuntu/.ansible/collections/ansible_collections/ansible/netcommon/plugins/connection/network_cli.py", line 586, in update_cli_prompt_context
    self.set_cli_prompt_context()
  File "/home/ubuntu/.ansible/collections/ansible_collections/cisco/nxos/plugins/cliconf/nxos.py", line 392, in set_cli_prompt_context
    raise AnsibleConnectionFailure(
ansible.errors.AnsibleConnectionFailure: cli prompt is not identified from the last received response window: b'4-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        84% 1570MB   7.7MB/s   00:36 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        85% 1578MB   7.7MB/s   00:35 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        85% 1585MB   7.7MB/s   00:34 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        85% 1593MB   7.8MB/s   00:33 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        86% 1599MB   7.6MB/s   00:33 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        86% 1607MB   7.6MB/s   00:32 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        87% 1615MB   7.6MB/s   00:31 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        87% 1623MB   7.7MB/s   00:30 ETA\rnxos64-cs.10.2.5.M.bin                                                                                                                                                                                                                                                                                                                                                                                                                                                                        87% 1631MB   7.7MB/s   00:28 ETA'

This will happen even tough I manually comfirm the transfer beeing complete several minutes before the error occurs.

Some times I can observe the last output saying:

...
Copy complete, now saving to disk (please wait)...
Copy complete.

When I manually do the transfer I can see the following happen:

# copy scp://[email protected]//tmp/nxos64-cs.10.2.5.M.bin bootflash:nxos64-cs.10.2.5.M.bin
Enter vrf (If no input, current vrf 'default' is considered): management
Outbound-ReKey for 192.168.1.10:22
Inbound-ReKey for 192.168.1.10:22
Password: 
nxos64-cs.10.2.5.M.bin                                                                                                                                                      54% 1015MB   7.5MB/s   01:51 ETAOutbound-ReKey for 192.168.1.10:22
Inbound-ReKey for 192.168.1.10:22
nxos64-cs.10.2.5.M.bin                                                                                                                                                     100% 1853MB   7.6MB/s   04:04    
Copy complete, now saving to disk (please wait)...
Copy complete.
@NilashishC NilashishC self-assigned this Aug 24, 2023
@jorgenspange
Copy link

Hello!

Not totally related, but it seems to fail towards Nexus 7k with http:

fatal: [N7K]: FAILED! => changed=false
  copy_cmd: copy http://********@192.168.1.10/n7000-s2-dk9.8.4.6a.bin bootflash:/n7000-s2-dk9.8.4.6a.bin vrf management
  file_system: 'bootflash:'
  local_file: bootflash:/n7000-s2-dk9.8.4.6a.bin
  remote_file: n7000-s2-dk9.8.4.6a.bin
  remote_scp_server: 192.168.1.10

Here's my play:

    - name: cisco.nxos.nxos_upload | Upload compliant software image.
      cisco.nxos.nxos_file_copy:
        file_pull: true
        file_pull_protocol: http
        file_pull_timeout: 7200
        remote_file: "{{ software_image }}"
        remote_scp_server: "{{ software_file_server }}"
        remote_scp_server_user: "undefined"
        remote_scp_server_password: "undefined"
        vrf: "{{ software_vrf }}"

Also would be nice that "remote_scp_server_user" and "remote_scp_server_password" not is required when using other protocols than scp. "Remote_scp_server" could also be called something like "remote_file_server" as other protcols than scp is used.

Thank you so much!

Br
Jørgen

@NilashishC
Copy link
Collaborator

@xibriz Could you please share the device interaction logs for the play that's timing out? I suspect it has to do with some minor prompt mismatch from your appliance.

Steps to enable it is described in here: https://docs.ansible.com/ansible/latest/network/user_guide/network_debug_troubleshooting.html#enabling-networking-device-interaction-logging

@NilashishC
Copy link
Collaborator

@jorgenspange I will be really helpful if you could open a separate issue for the ones your mentioned. Thank you!

@NilashishC
Copy link
Collaborator

@xibriz With help from @praveenramoorthy, I was able to reproduce the file_copy issue on our end as well. The other problem mentioned here about subsequent tasks failing even when the file copy task has ignore_errors: True is just a side effect of the original issue.

Similar to the logs that you've shared, we found that for every iteration, ansible "hangs" once the copy completes to a. We believe that this has something to do with an internal buffer in pylibssh filling up at this stage, resulting in network_cli not receiving any more data to process. As a result, it keeps waiting till command_timeout is triggered and then moves on to the next task.

While "resetting" the device's prompt state in preparation for the next task, the set_cli_prompt_context() method errors out with cli prompt unidentified... error, because the last received response from the device (via pylibssh) is 28% 438MB 5.2MB/s 03:21 ETA \rnxos6 which obviously does not contain a prompt, so it doesn't know where to begin. This happens due to the fact that the ssh connection persists between tasks in a single play, unless you force reset the connection.

When we switched to paramiko, using ansible_network_cli_ssh_type: paramiko, this seemed to work just fine and it correctly moved on to the next task as soon as the copy completed, regardless of how high the timeout was set.

I'll shortly be opening a ticket in the ansible/pylibssh GitHub repo to track this and keep this thread updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants