Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guest shuts-off with call trace after migrating between Boston and ZZ systems #19

Closed
balamuruhans opened this issue Oct 12, 2017 · 6 comments

Comments

@balamuruhans
Copy link

balamuruhans commented Oct 12, 2017

cde:info Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=160010 </cde:info>

Description:
Guest goes to shut-off state after migrating between Boston and ZZ systems, it is observed that migration succeeds without any issues but after migration in destination guest gets call traces and shuts off.

Call Trace:

[root@localhost ~]# [   89.329008] Unable to handle kernel paging request for instruction fetch
[   89.330591] Faulting instruction address: 0xc0000000000b19dc
[   89.331872] Oops: Kernel access of bad area, sig: 11 [#1]
[   89.333083] SMP NR_CPUS=1024 
[   89.333086] NUMA 
[   89.333762] pSeries
[   89.334699] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables virtio_balloon virtio_blk virtio_net virtio_scsi
[   89.344729] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-4.rel.git49564cb.el7.centos.ppc64le #1
[   89.347429] task: c000000001304780 task.stack: c000000001390000
[   89.349149] NIP: c0000000000b19dc LR: c0000000000bfdf4 CTR: c00000000fd80000
[   89.351232] REGS: c00000000fffb970 TRAP: 0400   Not tainted  (4.13.0-4.rel.git49564cb.el7.centos.ppc64le)
[   89.353951] MSR: 8000000040009033 <SF,EE,ME,IR,DR,RI,LE>
[   89.353989]   CR: 48002042  XER: 00000000
[   89.356620] CFAR: c00000000037f7f8 SOFTE: 0 
[   89.356620] GPR00: 0000000028002042 c00000000fffbbf0 c000000001397a00 0000000000000000 
[   89.356620] GPR04: 0000000000000000 0000000000000000 0000000000000000 000000000000004e 
[   89.356620] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000059 
[   89.356620] GPR12: c00000000065ffd0 c00000000fd80000 000000000dc5bd20 0000000000000060 
[   89.356620] GPR16: 0000000002cd41d8 fffffffffffffffd 000000000dc5bd20 000000000e453a80 
[   89.356620] GPR20: c000000001299f98 c00000000fffbd00 0000000000000001 0000000000000000 
[   89.356620] GPR24: 0000000000000000 c00000000169087c 0000000000000000 c00000000fffbd00 
[   89.356620] GPR28: 0000000000000010 c000000001690878 c000000001690738 c00000000169087c 
[   89.376064] NIP [c0000000000b19dc] plpar_hcall+0x38/0x58
[   89.377556] LR [c0000000000bfdf4] hvc_get_chars+0x34/0x90
[   89.379120] Call Trace:
[   89.379843] [c00000000fffbbf0] [c0000001f1313c00] 0xc0000001f1313c00 (unreliable)
[   89.382061] [c00000000fffbc80] [c000000000660108] hvterm_raw_get_chars+0x138/0x1e0
[   89.384245] [c00000000fffbce0] [c000000000662990] hvc_poll+0x120/0x380
[   89.386076] [c00000000fffbd80] [c000000000663d64] hvc_handle_interrupt+0x24/0x50
[   89.388154] [c00000000fffbda0] [c000000000171680] __handle_irq_event_percpu+0x90/0x2d0
[   89.390466] [c00000000fffbe60] [c0000000001718f8] handle_irq_event_percpu+0x38/0x90
[   89.392735] [c00000000fffbea0] [c0000000001719b8] handle_irq_event+0x68/0xd0
[   89.394748] [c00000000fffbed0] [c000000000176e74] handle_fasteoi_irq+0xc4/0x1f0
[   89.396811] [c00000000fffbf00] [c00000000016fd9c] generic_handle_irq+0x4c/0x80
[   89.398865] [c00000000fffbf20] [c000000000016624] __do_irq+0x94/0x200
[   89.400717] [c00000000fffbf90] [c00000000002afe4] call_do_irq+0x14/0x24
[   89.402592] [c0000000013939e0] [c00000000001682c] do_IRQ+0x9c/0x110
[   89.404380] [c000000001393a30] [c000000000008c58] hardware_interrupt_common+0x158/0x160
[   89.406641] --- interrupt: 501 at plpar_hcall_norets+0x1c/0x28
[   89.406641]     LR = check_and_cede_processor+0x34/0x50
[   89.409814] [c000000001393d20] [c0000000008caa60] check_and_cede_processor+0x20/0x50 (unreliable)
[   89.412348] [c000000001393d80] [c0000000008cae30] shared_cede_loop+0x50/0x160
[   89.414346] [c000000001393db0] [c0000000008c7f64] cpuidle_enter_state+0xc4/0x3d0
[   89.416427] [c000000001393e10] [c000000000157ccc] call_cpuidle+0x4c/0x80
[   89.418304] [c000000001393e30] [c000000000158170] do_idle+0x2b0/0x350
[   89.420133] [c000000001393ea0] [c000000000158418] cpu_startup_entry+0x38/0x40
[   89.422177] [c000000001393ed0] [c00000000000d924] rest_init+0xf4/0x110
[   89.424049] [c000000001393f00] [c000000000e14254] start_kernel+0x530/0x54c
[   89.426003] [c000000001393f90] [c00000000000b27c] start_here_common+0x1c/0x520
[   89.428121] Instruction dump:
[   89.429004] 7c421378 7c000026 90010008 60000000 f8810028 7ca42b78 7cc53378 7ce63b78 
[   89.431251] 7d074378 7d284b78 7d495378 44000022 <e9810028> f88c0000 f8ac0008 f8cc0010 
[   89.433525] ---[ end trace 89de1301a015647c ]---

Test Environment:

Libvirt - 3.6.0-3.rel.gitdd9401b.el7.centos.ppc64le
Qemu - 2.10.0-2.rel.gitc334a4e.el7.centos.ppc64le
SLOF - SLOF-20170724-2.rel.gitea31295.el7.centos.noarch
Guest Kernel - 4.13.0-4.rel.git49564cb.el7.centos.ppc64le
Host Kernel - 4.13.0-4.rel.git49564cb.el7.centos.ppc64le
@cdeadmin
Copy link

------- Comment From [email protected] 2017-10-23 04:31:48 EDT-------
> [root@localhost ~]# [ 89.329008] Unable to handle kernel paging request
> for instruction fetch
> [ 89.330591] Faulting instruction address: 0xc0000000000b19dc
> [ 89.331872] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 89.333083] SMP NR_CPUS=1024
> [ 89.333086] NUMA
> [ 89.333762] pSeries
> [ 89.334699] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter
> ebtables ip6table_filter ip6_tables virtio_balloon virtio_blk virtio_net
> virtio_scsi
> [ 89.344729] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.13.0-4.rel.git49564cb.el7.centos.ppc64le #1
> [ 89.347429] task: c000000001304780 task.stack: c000000001390000
> [ 89.349149] NIP: c0000000000b19dc LR: c0000000000bfdf4 CTR:
> c00000000fd80000
> [ 89.351232] REGS: c00000000fffb970 TRAP: 0400 Not tainted
> (4.13.0-4.rel.git49564cb.el7.centos.ppc64le)
> [ 89.353951] MSR: 8000000040009033 <SF,EE,ME,IR,DR,RI,LE>
> [ 89.353989] CR: 48002042 XER: 00000000
> [ 89.356620] CFAR: c00000000037f7f8 SOFTE: 0
> [ 89.356620] GPR00: 0000000028002042 c00000000fffbbf0 c000000001397a00
> 0000000000000000
> [ 89.356620] GPR04: 0000000000000000 0000000000000000 0000000000000000
> 000000000000004e
> [ 89.356620] GPR08: 0000000000000000 0000000000000000 0000000000000000
> 0000000000000059
> [ 89.356620] GPR12: c00000000065ffd0 c00000000fd80000 000000000dc5bd20
> 0000000000000060
> [ 89.356620] GPR16: 0000000002cd41d8 fffffffffffffffd 000000000dc5bd20
> 000000000e453a80
> [ 89.356620] GPR20: c000000001299f98 c00000000fffbd00 0000000000000001
> 0000000000000000
> [ 89.356620] GPR24: 0000000000000000 c00000000169087c 0000000000000000
> c00000000fffbd00
> [ 89.356620] GPR28: 0000000000000010 c000000001690878 c000000001690738
> c00000000169087c
> [ 89.376064] NIP [c0000000000b19dc] plpar_hcall+0x38/0x58
> [ 89.377556] LR [c0000000000bfdf4] hvc_get_chars+0x34/0x90
> [ 89.379120] Call Trace:
> [ 89.379843] [c00000000fffbbf0] [c0000001f1313c00] 0xc0000001f1313c00
> (unreliable)
> [ 89.382061] [c00000000fffbc80] [c000000000660108]
> hvterm_raw_get_chars+0x138/0x1e0
> [ 89.384245] [c00000000fffbce0] [c000000000662990] hvc_poll+0x120/0x380
> [ 89.386076] [c00000000fffbd80] [c000000000663d64]
> hvc_handle_interrupt+0x24/0x50
> [ 89.388154] [c00000000fffbda0] [c000000000171680]
> __handle_irq_event_percpu+0x90/0x2d0
> [ 89.390466] [c00000000fffbe60] [c0000000001718f8]
> handle_irq_event_percpu+0x38/0x90
> [ 89.392735] [c00000000fffbea0] [c0000000001719b8]
> handle_irq_event+0x68/0xd0
> [ 89.394748] [c00000000fffbed0] [c000000000176e74]
> handle_fasteoi_irq+0xc4/0x1f0
> [ 89.396811] [c00000000fffbf00] [c00000000016fd9c]
> generic_handle_irq+0x4c/0x80
> [ 89.398865] [c00000000fffbf20] [c000000000016624] __do_irq+0x94/0x200
> [ 89.400717] [c00000000fffbf90] [c00000000002afe4] call_do_irq+0x14/0x24
> [ 89.402592] [c0000000013939e0] [c00000000001682c] do_IRQ+0x9c/0x110
> [ 89.404380] [c000000001393a30] [c000000000008c58]
> hardware_interrupt_common+0x158/0x160
> [ 89.406641] --- interrupt: 501 at plpar_hcall_norets+0x1c/0x28
> [ 89.406641] LR = check_and_cede_processor+0x34/0x50
> [ 89.409814] [c000000001393d20] [c0000000008caa60]
> check_and_cede_processor+0x20/0x50 (unreliable)
> [ 89.412348] [c000000001393d80] [c0000000008cae30]
> shared_cede_loop+0x50/0x160
> [ 89.414346] [c000000001393db0] [c0000000008c7f64]
> cpuidle_enter_state+0xc4/0x3d0
> [ 89.416427] [c000000001393e10] [c000000000157ccc] call_cpuidle+0x4c/0x80
> [ 89.418304] [c000000001393e30] [c000000000158170] do_idle+0x2b0/0x350
> [ 89.420133] [c000000001393ea0] [c000000000158418]
> cpu_startup_entry+0x38/0x40
> [ 89.422177] [c000000001393ed0] [c00000000000d924] rest_init+0xf4/0x110
> [ 89.424049] [c000000001393f00] [c000000000e14254] start_kernel+0x530/0x54c
> [ 89.426003] [c000000001393f90] [c00000000000b27c]
> start_here_common+0x1c/0x520
> [ 89.428121] Instruction dump:
> [ 89.429004] 7c421378 7c000026 90010008 60000000 f8810028 7ca42b78
> 7cc53378 7ce63b78
> [ 89.431251] 7d074378 7d284b78 7d495378 44000022 <e9810028> f88c0000
> f8ac0008 f8cc0010
> [ 89.433525] ---[ end trace 89de1301a015647c ]---

Guest is shutting down due to tty driver crashing with ISI exception.
Its crashing in hvc_get_chars() routine while executing H_GET_TERM_CHAR hcall.

@cdeadmin
Copy link

------- Comment From [email protected] 2017-10-23 05:05:11 EDT-------
(In reply to comment #2)

> Guest is shutting down due to tty driver crashing with ISI exception.
> Its crashing in hvc_get_chars() routine while executing H_GET_TERM_CHAR
> hcall.

Hi Leonardo,

Guest is crashing with ISI execption in tty driver soon after migration. It doesn't
boot subsequently also and crashes each time. Seems to be qemu issue here.
Can you please have some one from look into this ?

@cdeadmin
Copy link

cdeadmin commented Dec 6, 2017

------- Comment From [email protected] 2017-12-06 15:10:36 EDT-------
I wasn't able to reproduce this bug. The test environment was:

  • ZZ machine: kernel 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le

  • Boston machine: kernel 4.14.0-1.rel.git68b4afb.el7.centos.ppc64le

  • upstream QEMU (2.11.0.-rc4) in both hosts.

I tried migrating with 2 different guests: Ubuntu 17.04 with kernel 4.10.0-28-ppc64le and Fedora26 with kernel 4.13.16-202.fc26.ppc64le. With both guests, migration between ZZ -> Boston Boston -> ZZ worked without problems.

I see that the bug was opened against QEMU 2.10.0-2. It is likely that whatever caused the bug to happen is already fixed upstream.

aik pushed a commit that referenced this issue Jan 31, 2018
Direct leak of 160 byte(s) in 4 object(s) allocated from:
    #0 0x55ed7678cda8 in calloc (/home/elmarco/src/qq/build/x86_64-softmmu/qemu-system-x86_64+0x797da8)
    #1 0x7f3f5e725f75 in g_malloc0 /home/elmarco/src/gnome/glib/builddir/../glib/gmem.c:124
    #2 0x55ed778aa3a7 in query_option_descs /home/elmarco/src/qq/util/qemu-config.c:60:16
    #3 0x55ed778aa307 in get_drive_infolist /home/elmarco/src/qq/util/qemu-config.c:140:19
    #4 0x55ed778a9f40 in qmp_query_command_line_options /home/elmarco/src/qq/util/qemu-config.c:254:36
    #5 0x55ed76d4868c in qmp_marshal_query_command_line_options /home/elmarco/src/qq/build/qmp-marshal.c:3078:14
    #6 0x55ed77855dd5 in do_qmp_dispatch /home/elmarco/src/qq/qapi/qmp-dispatch.c:104:5
    #7 0x55ed778558cc in qmp_dispatch /home/elmarco/src/qq/qapi/qmp-dispatch.c:131:11
    #8 0x55ed768b592f in handle_qmp_command /home/elmarco/src/qq/monitor.c:3840:11
    #9 0x55ed7786ccfe in json_message_process_token /home/elmarco/src/qq/qobject/json-streamer.c:105:5
    #10 0x55ed778fe37c in json_lexer_feed_char /home/elmarco/src/qq/qobject/json-lexer.c:323:13
    #11 0x55ed778fdde6 in json_lexer_feed /home/elmarco/src/qq/qobject/json-lexer.c:373:15
    #12 0x55ed7786cd83 in json_message_parser_feed /home/elmarco/src/qq/qobject/json-streamer.c:124:12
    #13 0x55ed768b559e in monitor_qmp_read /home/elmarco/src/qq/monitor.c:3882:5
    #14 0x55ed77714f29 in qemu_chr_be_write_impl /home/elmarco/src/qq/chardev/char.c:167:9
    #15 0x55ed77714fde in qemu_chr_be_write /home/elmarco/src/qq/chardev/char.c:179:9
    #16 0x55ed7772ffad in tcp_chr_read /home/elmarco/src/qq/chardev/char-socket.c:440:13
    #17 0x55ed7777113b in qio_channel_fd_source_dispatch /home/elmarco/src/qq/io/channel-watch.c:84:12
    #18 0x7f3f5e71d90b in g_main_dispatch /home/elmarco/src/gnome/glib/builddir/../glib/gmain.c:3182
    #19 0x7f3f5e71e7ac in g_main_context_dispatch /home/elmarco/src/gnome/glib/builddir/../glib/gmain.c:3847
    #20 0x55ed77886ffc in glib_pollfds_poll /home/elmarco/src/qq/util/main-loop.c:214:9
    #21 0x55ed778865fd in os_host_main_loop_wait /home/elmarco/src/qq/util/main-loop.c:261:5
    #22 0x55ed77886222 in main_loop_wait /home/elmarco/src/qq/util/main-loop.c:515:11
    #23 0x55ed76d2a4df in main_loop /home/elmarco/src/qq/vl.c:1995:9
    #24 0x55ed76d1cb4a in main /home/elmarco/src/qq/vl.c:4914:5
    #25 0x7f3f555f6039 in __libc_start_main (/lib64/libc.so.6+0x21039)

Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Eric Blake <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
aik pushed a commit that referenced this issue Jan 31, 2018
Spotted thanks to ASAN:

==25226==ERROR: AddressSanitizer: global-buffer-overflow on address 0x556715a1f120 at pc 0x556714b6f6b1 bp 0x7ffcdfac1360 sp 0x7ffcdfac1350
READ of size 1 at 0x556715a1f120 thread T0
    #0 0x556714b6f6b0 in init_disasm /home/elmarco/src/qemu/disas/s390.c:219
    #1 0x556714b6fa6a in print_insn_s390 /home/elmarco/src/qemu/disas/s390.c:294
    #2 0x55671484d031 in monitor_disas /home/elmarco/src/qemu/disas.c:635
    #3 0x556714862ec0 in memory_dump /home/elmarco/src/qemu/monitor.c:1324
    #4 0x55671486342a in hmp_memory_dump /home/elmarco/src/qemu/monitor.c:1418
    #5 0x5567148670be in handle_hmp_command /home/elmarco/src/qemu/monitor.c:3109
    #6 0x5567148674ed in qmp_human_monitor_command /home/elmarco/src/qemu/monitor.c:613
    #7 0x556714b00918 in qmp_marshal_human_monitor_command /home/elmarco/src/qemu/build/qmp-marshal.c:1704
    #8 0x556715138a3e in do_qmp_dispatch /home/elmarco/src/qemu/qapi/qmp-dispatch.c:104
    #9 0x556715138f83 in qmp_dispatch /home/elmarco/src/qemu/qapi/qmp-dispatch.c:131
    #10 0x55671485cf88 in handle_qmp_command /home/elmarco/src/qemu/monitor.c:3839
    #11 0x55671514e80b in json_message_process_token /home/elmarco/src/qemu/qobject/json-streamer.c:105
    #12 0x5567151bf2dc in json_lexer_feed_char /home/elmarco/src/qemu/qobject/json-lexer.c:323
    #13 0x5567151bf827 in json_lexer_feed /home/elmarco/src/qemu/qobject/json-lexer.c:373
    #14 0x55671514ee62 in json_message_parser_feed /home/elmarco/src/qemu/qobject/json-streamer.c:124
    #15 0x556714854b1f in monitor_qmp_read /home/elmarco/src/qemu/monitor.c:3881
    #16 0x556715045440 in qemu_chr_be_write_impl /home/elmarco/src/qemu/chardev/char.c:172
    #17 0x556715047184 in qemu_chr_be_write /home/elmarco/src/qemu/chardev/char.c:184
    #18 0x55671505a8e6 in tcp_chr_read /home/elmarco/src/qemu/chardev/char-socket.c:440
    #19 0x5567150943c3 in qio_channel_fd_source_dispatch /home/elmarco/src/qemu/io/channel-watch.c:84
    #20 0x7fb90292b90b in g_main_dispatch ../glib/gmain.c:3182
    #21 0x7fb90292c7ac in g_main_context_dispatch ../glib/gmain.c:3847
    #22 0x556715162eca in glib_pollfds_poll /home/elmarco/src/qemu/util/main-loop.c:214
    #23 0x556715163001 in os_host_main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:261
    #24 0x5567151631fa in main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:515
    #25 0x556714ad6d3b in main_loop /home/elmarco/src/qemu/vl.c:1950
    #26 0x556714ade329 in main /home/elmarco/src/qemu/vl.c:4865
    #27 0x7fb8fe5c9009 in __libc_start_main (/lib64/libc.so.6+0x21009)
    #28 0x5567147af4d9 in _start (/home/elmarco/src/qemu/build/s390x-softmmu/qemu-system-s390x+0xf674d9)

0x556715a1f120 is located 32 bytes to the left of global variable 'char_hci_type_info' defined in '/home/elmarco/src/qemu/hw/bt/hci-csr.c:493:23' (0x556715a1f140) of size 104
0x556715a1f120 is located 8 bytes to the right of global variable 's390_opcodes' defined in '/home/elmarco/src/qemu/disas/s390.c:860:33' (0x556715a15280) of size 40600

This fix is based on Andreas Arnez <[email protected]> upstream
commit:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commitdiff;h=9ace48f3d7d80ce09c5df60cccb433470410b11b

2014-08-19  Andreas Arnez  <[email protected]>

       * s390-dis.c (init_disasm): Simplify initialization of
       opc_index[].  This also fixes an access after the last element
       of s390_opcodes[].

Signed-off-by: Marc-André Lureau <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
@cdeadmin
Copy link

------- Comment From [email protected] 2018-02-14 11:01:16 EDT-------
I've re-read this bug and this looks remarkably similar to a kernel bug that was already sorted out here:

https://bugzilla.linux.ibm.com/show_bug.cgi?id=163870

This is the patch that fixes that bug:

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=62e984ddfd6b056d399e24113f5e6a

It's not upstream yet but it is already in the 'fixes' branch of the ppc kernel maintainer. As soon as it is upstream we can get a patched Host OS kernel to verify if it fixes this bug we're seeing here too.

aik pushed a commit that referenced this issue Mar 29, 2018
Spotted thanks to ASAN:
QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64 tests/migration-test -p /x86_64/migration/bad_dest

==30302==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7f60efba1a38 in __interceptor_calloc (/lib64/libasan.so.4+0xdea38)
    #1 0x7f60eef3cf75 in g_malloc0 ../glib/gmem.c:124
    #2 0x55ca9094702c in error_copy /home/elmarco/src/qemu/util/error.c:203
    #3 0x55ca9037a30f in migrate_set_error /home/elmarco/src/qemu/migration/migration.c:1139
    #4 0x55ca9037a462 in migrate_fd_error /home/elmarco/src/qemu/migration/migration.c:1150
    #5 0x55ca9038162b in migrate_fd_connect /home/elmarco/src/qemu/migration/migration.c:2411
    #6 0x55ca90386e41 in migration_channel_connect /home/elmarco/src/qemu/migration/channel.c:81
    #7 0x55ca9038335e in socket_outgoing_migration /home/elmarco/src/qemu/migration/socket.c:85
    #8 0x55ca9083dd3a in qio_task_complete /home/elmarco/src/qemu/io/task.c:142
    #9 0x55ca9083d6cc in gio_task_thread_result /home/elmarco/src/qemu/io/task.c:88
    #10 0x7f60eef37317 in g_idle_dispatch ../glib/gmain.c:5552
    #11 0x7f60eef3490b in g_main_dispatch ../glib/gmain.c:3182
    #12 0x7f60eef357ac in g_main_context_dispatch ../glib/gmain.c:3847
    #13 0x55ca90927231 in glib_pollfds_poll /home/elmarco/src/qemu/util/main-loop.c:214
    #14 0x55ca90927420 in os_host_main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:261
    #15 0x55ca909275fa in main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:515
    #16 0x55ca8fc1c2a4 in main_loop /home/elmarco/src/qemu/vl.c:1942
    #17 0x55ca8fc2eb3a in main /home/elmarco/src/qemu/vl.c:4724
    #18 0x7f60e4082009 in __libc_start_main (/lib64/libc.so.6+0x21009)

Indirect leak of 45 byte(s) in 1 object(s) allocated from:
    #0 0x7f60efba1850 in malloc (/lib64/libasan.so.4+0xde850)
    #1 0x7f60eef3cf0c in g_malloc ../glib/gmem.c:94
    #2 0x7f60eef3d1cf in g_malloc_n ../glib/gmem.c:331
    #3 0x7f60eef596eb in g_strdup ../glib/gstrfuncs.c:363
    #4 0x55ca90947085 in error_copy /home/elmarco/src/qemu/util/error.c:204
    #5 0x55ca9037a30f in migrate_set_error /home/elmarco/src/qemu/migration/migration.c:1139
    #6 0x55ca9037a462 in migrate_fd_error /home/elmarco/src/qemu/migration/migration.c:1150
    #7 0x55ca9038162b in migrate_fd_connect /home/elmarco/src/qemu/migration/migration.c:2411
    #8 0x55ca90386e41 in migration_channel_connect /home/elmarco/src/qemu/migration/channel.c:81
    #9 0x55ca9038335e in socket_outgoing_migration /home/elmarco/src/qemu/migration/socket.c:85
    #10 0x55ca9083dd3a in qio_task_complete /home/elmarco/src/qemu/io/task.c:142
    #11 0x55ca9083d6cc in gio_task_thread_result /home/elmarco/src/qemu/io/task.c:88
    #12 0x7f60eef37317 in g_idle_dispatch ../glib/gmain.c:5552
    #13 0x7f60eef3490b in g_main_dispatch ../glib/gmain.c:3182
    #14 0x7f60eef357ac in g_main_context_dispatch ../glib/gmain.c:3847
    #15 0x55ca90927231 in glib_pollfds_poll /home/elmarco/src/qemu/util/main-loop.c:214
    #16 0x55ca90927420 in os_host_main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:261
    #17 0x55ca909275fa in main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:515
    #18 0x55ca8fc1c2a4 in main_loop /home/elmarco/src/qemu/vl.c:1942
    #19 0x55ca8fc2eb3a in main /home/elmarco/src/qemu/vl.c:4724
    #20 0x7f60e4082009 in __libc_start_main (/lib64/libc.so.6+0x21009)

Signed-off-by: Marc-André Lureau <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
aik pushed a commit that referenced this issue May 15, 2018
Eric Auger reported the problem days ago that OOB broke ARM when running
with libvirt:

http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg06231.html

The problem was that the monitor dispatcher bottom half was bound to
qemu_aio_context now, which could be polled unexpectedly in block code.
We should keep the dispatchers run in iohandler_ctx just like what we
did before the Out-Of-Band series (chardev uses qio, and qio binds
everything with iohandler_ctx).

If without this change, QMP dispatcher might be run even before reaching
main loop in block IO path, for example, in a stack like (the ARM case,
"cont" command handler run even during machine init phase):

        #0  qmp_cont ()
        #1  0x00000000006bd210 in qmp_marshal_cont ()
        #2  0x0000000000ac05c4 in do_qmp_dispatch ()
        #3  0x0000000000ac07a0 in qmp_dispatch ()
        #4  0x0000000000472d60 in monitor_qmp_dispatch_one ()
        #5  0x000000000047302c in monitor_qmp_bh_dispatcher ()
        #6  0x0000000000acf374 in aio_bh_call ()
        #7  0x0000000000acf428 in aio_bh_poll ()
        #8  0x0000000000ad5110 in aio_poll ()
        #9  0x0000000000a08ab8 in blk_prw ()
        #10 0x0000000000a091c4 in blk_pread ()
        #11 0x0000000000734f94 in pflash_cfi01_realize ()
        #12 0x000000000075a3a4 in device_set_realized ()
        #13 0x00000000009a26cc in property_set_bool ()
        #14 0x00000000009a0a40 in object_property_set ()
        #15 0x00000000009a3a08 in object_property_set_qobject ()
        #16 0x00000000009a0c8c in object_property_set_bool ()
        #17 0x0000000000758f94 in qdev_init_nofail ()
        #18 0x000000000058e190 in create_one_flash ()
        #19 0x000000000058e2f4 in create_flash ()
        #20 0x00000000005902f0 in machvirt_init ()
        #21 0x00000000007635cc in machine_run_board_init ()
        #22 0x00000000006b135c in main ()

Actually the problem is more severe than that.  After we switched to the
qemu AIO handler it means the monitor dispatcher code can even be called
with nested aio_poll(), then it can be an explicit aio_poll() inside
another main loop aio_poll() which could be racy too; breaking code
like TPM and 9p that use nested event loops.

Switch to use the iohandler_ctx for monitor dispatchers.

My sincere thanks to Eric Auger who offered great help during both
debugging and verifying the problem.  The ARM test was carried out by
applying this patch upon QEMU 2.12.0-rc0 and problem is gone after the
patch.

A quick test of mine shows that after this patch applied we can pass all
raw iotests even with OOB on by default.

CC: Eric Blake <[email protected]>
CC: Markus Armbruster <[email protected]>
CC: Stefan Hajnoczi <[email protected]>
CC: Fam Zheng <[email protected]>
Reported-by: Eric Auger <[email protected]>
Tested-by: Eric Auger <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Eric Blake <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Eric Blake <[email protected]>
@cdeadmin
Copy link

------- Comment From [email protected] 2018-07-31 10:47:03 EDT-------
Bala, can you re-test this bug with latest HostOS? It should be already fixed.

@cdeadmin
Copy link

------- Comment From [email protected] 2018-08-31 13:20:06 EDT-------
No response in a long time, so just going to close.

@cdeadmin cdeadmin closed this as completed Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants