Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need config for RTX 3090 #270

Open
pal-stdr opened this issue Nov 30, 2022 · 0 comments
Open

Need config for RTX 3090 #270

pal-stdr opened this issue Nov 30, 2022 · 0 comments

Comments

@pal-stdr
Copy link

pal-stdr commented Nov 30, 2022

I have seen accel-sim has added configs for RTX 3070. But I was searching configs for RTX 3090. Is it possible to add this config? If not, could anyone please tell/help me how to change the required parameters in RTX 3070 config to make compatible with RTX 3090?

FYI, I am not specialist of GPU architecture. I just understand the basic stuffs. Currently I am running code for adding 2 vectors in parallel for 1000 array/vector elements. Using RTX 3070 config, simulator gives me ~5500 cycles. Where as the real HW (RTX 3090) is giving me ~3400 cycles. It's around ~38% error compared to the real HW cycles. The accel-sim paper claimed the MAE is ~15%, where as I am getting 38%.

I was also going through the official doc of GPGPU-sim. Changed some parameters here and there. But always the sim cycles remain unchanged.

So either I understood something wrong about the paper claim, or my config is wrong, or I am thinking the whole thing in a wrong way. Or everything could be right. I would be very very grateful if anyone shows me If I am doing anything wrong.

I am sharing the code that I am using

#include <stdio.h>

#define N 1000


__global__ void vector_add_cuda(float *out, float *a, float *b, int n) {
    
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    
    if(id < N)
    {
        out[id] = a[id] + b[id];
    }
}

int main(){

    // Number of bytes to allocate for N doubles
    size_t bytes = N * sizeof(float);

    float *h_a, *h_b, *out, *h_out, *d_a, *d_b, *d_out; 

    // Allocate data in host pointer
    cudaMallocHost(&h_a, bytes);
    cudaMallocHost(&h_b, bytes);
    cudaMallocHost(&h_out, bytes);

    // Initialize array
    for(int i = 0; i < N; i++){
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;        
    }

    // Allocate device memory for d_a, d_b, d_out
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_out, bytes);


    // Transfer data from host to device memory
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);


    int THREADS = 256;
    int BLOCKS = (N + THREADS - 1) / THREADS;


    // Kernel function
    vector_add_cuda <<< BLOCKS, THREADS >>> (d_out, d_a, d_b, N);


    // Transfer computed data from device to host memory
    cudaMemcpy(h_out, d_out, bytes, cudaMemcpyDeviceToHost);


    // Check/validate results
    for(int i = 0; i < N; i++)
    {
        if (i < 10)
        {
            printf("h_a[%d] = %.5lf, h_out[%d] = %.5lf\n", i, h_a[i], i, h_out[i]);
        }

        if (out[i] != h_out[i])
        {
            printf(" Error at %d, h_out[%d] = %.5lf, out[%d] = %.5lf\n", i, i, h_out[i], i, out[i]);
            break;
        }

    }

    // Cleanup after kernel execution
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_out);

    cudaFree(h_a);
    cudaFree(h_b);
    cudaFree(h_out);

    return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant