Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rare crash in pthread_detach() on dropping grpcio::client::Client #653

Open
pdmorrow opened this issue Aug 26, 2024 · 1 comment
Open

rare crash in pthread_detach() on dropping grpcio::client::Client #653

pdmorrow opened this issue Aug 26, 2024 · 1 comment

Comments

@pdmorrow
Copy link

Describe the bug

When a grpcio::client::Client object went out of scope and each of the objects elements where dropped, I observed a crash when attempting to drop an "Environment" object:

(gdb) bt
#0  ___pthread_detach (th=139936879605440) at ./nptl/pthread_detach.c:48
#1  0x000055d69512f660 in core::ptr::drop_in_place<std::sys::pal::unix::thread::Thread> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#2  core::ptr::drop_in_place<std::thread::JoinInner<()>> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#3  core::ptr::drop_in_place<std::thread::JoinHandle<()>> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#4  core::ptr::drop_in_place<[std::thread::JoinHandle<()>]> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#5  alloc::vec::{impl#24}::drop<std::thread::JoinHandle<()>, alloc::alloc::Global> (
    self=0x55d696f1e8a8) at /usr/src/rustc-1.78.0/library/alloc/src/vec/mod.rs:3205
#6  core::ptr::drop_in_place<alloc::vec::Vec<std::thread::JoinHandle<()>, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515                  
#7  core::ptr::drop_in_place<grpcio::env::Environment> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#8  alloc::sync::Arc<grpcio::env::Environment, alloc::alloc::Global>::drop_slow<grpcio::env::Environment, alloc::alloc::Global> (self=<optimized out>)                        
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:1804
#9  0x000055d69512f726 in alloc::sync::{impl#33}::drop<grpcio::env::Environment, alloc::alloc::Global> (self=0x55d696f20130)                                                  
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:2459
#10 core::ptr::drop_in_place<alloc::sync::Arc<grpcio::env::Environment, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515                    
#11 core::ptr::drop_in_place<grpcio::channel::ChannelInner> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#12 alloc::sync::Arc<grpcio::channel::ChannelInner, alloc::alloc::Global>::drop_slow<grpcio::channel::ChannelInner, alloc::alloc::Global> (self=<optimized out>)              
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:1804
#13 0x000055d695129a21 in alloc::sync::{impl#33}::drop<grpcio::channel::ChannelInner, alloc::alloc::Global> (self=0x7ffd4eb24cc8)                                             
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:2459
#14 core::ptr::drop_in_place<alloc::sync::Arc<grpcio::channel::ChannelInner, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515               
#15 core::ptr::drop_in_place<grpcio::channel::Channel> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#16 core::ptr::drop_in_place<grpcio::client::Client> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#17 core::ptr::drop_in_place<onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515                   
#18 0x000055d69512a929 in onboarding_geolocation_client::send_geolocation_ping (
    interface_name=..., address=..., hostname=..., cert_chain_bytes=..., 
    client_key=0x7ffd4eb24c30, client_cert=<optimized out>)
    at src/bin/onboarding-geolocation-client.rs:104
#19 onboarding_geolocation_client::main ()
    at src/bin/onboarding-geolocation-client.rs:164
(gdb) 

In our application we have:

/// Construct a gRPC channel and send a geolocation message
fn send_geolocation_ping(
    interface_name: &str,
    address: SocketAddr,
    hostname: &str,
    cert_chain_bytes: &[u8],
    client_cert: &openssl::x509::X509,
    client_key: &onboarding::PrivateKey,
) -> Result<(), Error> {
    // 1. Make a gRPC channel (securely, if we have a root cert)

    let env = std::sync::Arc::new(grpcio::EnvBuilder::new().build());

    // We are connecting by IP address, but we must ensure the SSL certificate
    // contains the right hostname
    let mut builder = grpcio::ChannelBuilder::new(env).override_ssl_target(hostname);

    if !cert_chain_bytes.is_empty() {
        // Only set up an encrypted connection if we have the server-side certificates
        let creds = grpcio::ChannelCredentialsBuilder::new()
            .root_cert(cert_chain_bytes.to_owned())
            .cert(client_cert.to_pem()?, client_key.to_grpc_cert_bytes())
            .build();
        builder = builder.set_credentials(creds);
    }
    let ch = builder.connect(&address.to_string());

    // 2. Make an Onboarding Gateway Client that uses that gRPC channel

    let gw: onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient = onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient::new(ch);

    // 3. Build the request

    let mut req = onboarding::proto::onboarding_gw_messages::DiscoverGeoIPLocationRequest::new();
    req.set_interface_name(interface_name.to_owned());

    // 4. Send it (synchronously) and wait for the response

    let opt = grpcio::CallOption::default().timeout(ONBOARDING_GW_GRPC_TIMEOUT);
    let _rsp = gw
        .discover_geo_ip_location_opt(&req, opt)
        .map_err(onboarding::error::GrpcError::from)?;

    Ok(())
}

The routine is called from the processes main thread and usually the process exits having been able to perform the RPC. The Environment struct looks like:

pub struct Environment {
    cqs: Vec<CompletionQueue>,
    idx: AtomicUsize,
    _handles: Vec<JoinHandle<()>>,
}

So it seems on the "_handles" field being dropped we've ended up in this corrupted state. The crashing frame is:

(gdb) f 0
#0  ___pthread_detach (th=139936879605440) at ./nptl/pthread_detach.c:48
48      ./nptl/pthread_detach.c: No such file or directory.
(gdb) p *((struct pthread *) 139936879605440)
Cannot access memory at address 0x7f4597fff6c0
(gdb) 

To Reproduce

I've been unable to reproduce this at all sadly so understand if nothing much can be done.

Expected behavior

No crash.

System information

  • CPU architecture: x64_64
  • Distribution and kernel version: Debian 12, 6.1.0-24-amd64.
  • SELinux on?: No SELinux
  • Any other system details we should know?: The system has a total of 72 CPU cores, though only cores 0-11 are available to the process in question (via cpuaffinity).

Additional context

Seems very rare, we run this process every 30s on 1000s of machines and have never seen this crash other than once. Thus I don't expect much to happen, perhaps this is some horrendous memory corruption.

I did wonder if it's worth joining all the thread handles in the impl of the drop trait for Environment? C.F.

fn drop(&mut self) {

@BusyJay
Copy link
Member

BusyJay commented Aug 27, 2024

Seems more like a Rust bug to me. 🤔

How about reusing Environment like putting it to some global states? Repeatedly creating threads is not good for performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@pdmorrow @BusyJay and others