-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the NielsenDSRS wiki!
More information about the system configuration:
The selection of the number of workers and threads depends on several factors, including your system's hardware specifications and the nature of your data processing tasks. Here's a guide to help users decide:
Number of Workers:
Generally, set the number of workers equal to the number of physical CPU cores available on your machine. For example, if you have a 4-core CPU, you might start with 4 workers. If your system has hyperthreading, you might consider using up to the number of logical cores, but be aware that this doesn't always improve performance and can sometimes lead to resource contention.
Threads per Worker:
In most cases, it's recommended to set threads_per_worker=1, as you've done in your example. This is because Dask workers are already designed to handle multiple tasks concurrently, and adding more threads per worker can sometimes lead to unnecessary overhead.
Memory per Worker:
Your approach of dividing total system memory by the number of workers is a good starting point. However, make sure to leave some memory for the operating system and other processes. A good rule of thumb is to use about 75-80% of total system memory for Dask.
Considerations:
I/O-bound vs. CPU-bound tasks: If your tasks are I/O-bound (e.g., reading/writing files), you might benefit from more workers. If they're CPU-bound, match the number of workers to your CPU cores. Memory usage: If your tasks require a lot of memory, you might need to reduce the number of workers to give each more memory. Network communication: More workers can increase inter-worker communication overhead, so there's a balance to strike.
Experimentation:
The optimal configuration often requires some experimentation. Start with a number of workers equal to your CPU cores and adjust based on performance. Use Dask's diagnostics tools to monitor performance and resource usage.
import psutil
from dask.distributed import Client
# Get system information
total_memory_gb = psutil.virtual_memory().total / (1024**3) # Total RAM in GB
cpu_cores = psutil.cpu_count(logical=False) # Physical CPU cores
# Calculate workers and memory
n_workers = cpu_cores
memory_per_worker_gb = int(0.75 * total_memory_gb / n_workers) # Using 75% of total RAM
# Start the client
client = Client(n_workers=n_workers, threads_per_worker=1,
memory_limit=f'{memory_per_worker_gb}GB')
print(client)
print(f"Using {n_workers} workers with {memory_per_worker_gb}GB per worker")
This script uses the psutil library to automatically detect system resources and set up the Dask client accordingly. Users can then adjust these values if needed based on their specific use case and performance observations.