Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for Mac M3 #2830

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Fixes for Mac M3 #2830

wants to merge 10 commits into from

Conversation

JoeyOverby
Copy link

@JoeyOverby JoeyOverby commented Sep 16, 2024

This is really just a POC - and not a polished PR. This is mostly an attempt to just get this working and then hopefully the owner of the original repo will grant me the ability to put a PR into that one so others can help clean up the code.

My only focus of this was to get it to work for Textual Inversion training - I don't know if it works for the other functionality.

(IMPORTANT: I have a PR for the sd-scripts here, but I'm VERY unsure if it's correct... will update later!)

Fixes

  • Instead of trying to open a window to ask if overwriting an embedding model file is ok, simply back it up
  • Add ability to use Mac MPS as a device when it's available
  • Fix packages/versions to work with MPS setup

Notes

A lot of manual steps were needed while trying to get the packages working. I will try a clean install later, but for now I'm putting in the notes I took while doing this in case someone else wants to do this as well.

  • Had to remove tensorboard completely. I wasn't able to get the numpy versions to work with both (so I run tensorboard separately in a different venv).
  • Had to remove tensorflow (which wasn't needed for the Textual Inversion training I was doing).
  • I'd recommend removing (or backing up/renaming) your venv folder, so you don't have to run the uninstall steps below

Full List of Installed Packages

I'll touch back up the install scripts (and requirements files), but for now I wanted to give everyone the packages and versions that worked for me!


Package                      Version     Editable project location
---------------------------- ----------- -------------------------------------
accelerate                   0.25.0
aiohttp                      3.10.5
altair                       4.2.2
astunparse                   1.6.3
bitsandbytes                 0.41.1
blendmodes                   2022
dadaptation                  3.1
easygui                      0.98.3
fairscale                    0.4.13
gast                         0.6.0
google-pasta                 0.2.0
gradio                       4.43.0
h5py                         3.11.0
imagesize                    1.4.1
invisible-watermark          0.2.0
keras                        2.14.0
libclang                     18.1.1
library                      0.0.0       
lion-pytorch                 0.0.6
lycoris_lora                 2.2.0.post3
ml-dtypes                    0.2.0
numba                        0.59.1
omegaconf                    2.3.0
onnx                         1.16.1
onnxruntime                  1.17.1
open-clip-torch              2.20.0
opt-einsum                   3.3.0
pip                          24.2
prodigyopt                   1.0
pytorch-lightning            2.0.0
scipy                        1.11.4
tensorboard                  2.14.1
tensorflow-io-gcs-filesystem 0.37.1
termcolor                    2.4.0
tk                           0.1.0
torchaudio                   2.4.1
voluptuous                   0.13.1
wandb                        0.15.11
wrapt                        1.14.1

Manual commands I ran to get to this point

pip uninstall open-clip-torch
pip uninstall tensorflow-macos tensorflow-metal tensorflow-estimator -y
pip install --force-reinstall torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/metal.html
pip install numpy==1.26.0 --force-reinstall
pip install Pillow==9.5.0 --force-reinstall
pip install blendmodes==2022 numba==0.59.1 scipy==1.11.4 --force-reinstall

Verify MPS Installed Correctly

python -c "import torch; print(torch.backends.mps.is_available())"
python -c "import numpy; print(numpy.__version__)"
python -c "import pillow; print(pillow.__version__)"

Successful Config

This is a copy of my successful config for running Textual Inversion Training (obviously fill in your paths and names for what you want/need) .
And by successful, I mean it ran. Not that it did a great job. Still working on adjusting the parameters - but hopefully this will give you all of the settings you'd need to at least run (as figuring out things like float and AdamW took me awhile).

bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_skip = 1
dynamo_backend = "no"
enable_bucket = true
gradient_accumulation_steps = 2
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
init_word = "woman"
learning_rate = 5e-6
logging_dir = "<REPO_PATH>/kohya_ss/outputs/<PATH TO YOUR TRAINING DIR>/log"
loss_type = "l2"
lr_scheduler = "cosine"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 10
max_bucket_reso = 1024
max_data_loader_n_workers = 8
max_timestep = 1000
max_token_length = 150
max_train_steps = 100
min_bucket_reso = 512
min_snr_gamma = 5
mixed_precision = "no"
multires_noise_discount = 0.3
no_half_vae = true
noise_offset_type = "Original"
num_vectors_per_token = 12
optimizer_args = []
optimizer_type = "AdamW"
output_dir = "<REPO_PATH>/kohya_ss/outputs/<PATH TO YOUR TRAINING DIR>/model"
output_name = "MyTrainedModel"
pretrained_model_name_or_path = "<PATH TO YOUR TRAINING CHECKPOINT>.safetensors"
prior_loss_weight = 1
resolution = "1024,1024"
resume = "<REPO_PATH>/kohya_ss/outputs/<PATH TO YOUR TRAINING DIR>/model/<PREVIOUS MODEL>"
sample_every_n_epochs = 1
sample_prompts = "<REPO_PATH>/kohya_ss/outputs/<PATH TO YOUR TRAINING DIR>/model/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_every_n_steps = 20
save_last_n_steps = 15
save_last_n_steps_state = 15
save_model_as = "safetensors"
save_precision = "float"
save_state = true
save_state_on_train_end = true
sdpa = true
token_string = "YOUR_TOKEN_STRING_HERE"
train_batch_size = 6
train_data_dir = "<REPO_PATH>/repos/kohya_ss/outputs/<PATH TO YOUR TRAINING DIR>/img"
use_object_template = true

@bmaltais
Copy link
Owner

Please ensure that none of the changes you submit will introduce issues to the current solution. I see that your changes are marked as a draft, and I understand there's still work to be done.

I appreciate you taking the time to address the MacOS situation, especially since it's an area that hasn’t been looked after in quite some time. However, as I don't own an M3 Mac, I haven't had the opportunity to focus on it myself.

One critical point to keep in mind is to avoid introducing any problems for other users, which could block your code from being merged. For example, there are a significant number of changes in the common requirements.txt file, and I believe this could cause major issues for Linux and Windows users. Perhaps try to use dedicates requirements-macos-m3.txt file where everything is as it should be? This might require changes as to how the setup.py script is run... as it was never intended to support so many variations...

Therefore, I recommend minimizing modifications to the requirements.txt file and instead focusing on creating a requirements-macos-m3.txt file that ensures compatibility with M3 Macs (and possibly M1 and M2 as well).

Regarding the submodule changes, those won’t be accepted. Proper support for M3 should be addressed in the kohya_ss sd-scripts upstream. I won’t approve pulling submodule changes from other sources, as this could introduce concerns for both current and future users.

I’m hopeful we can find a solution that provides proper MacOS support without disrupting the experience for Linux and Windows users.

@JoeyOverby
Copy link
Author

I'm sorry for the delay - apparently my notifications went to an old work email that I no longer have.

I actually think that we don't need to make any changes to the sd-scripts submodule. I'll try to test that here in a bit. My only concern is what happens with conflicting versions between the mac requirements files and the generic requirements file?

Would it make more sense to have just one file for mac and then not reference the generic requirements one in the mac setups?

And happy to help! Thank you for taking the time to respond. I appreciate it.

@bmaltais
Copy link
Owner

I believe the separate requirements file for Mac is the best approach. If my memory serves me correctly, I think it’s possible to achieve this via a parameter. You can build your solution around this concept.

@JoeyOverby
Copy link
Author

JoeyOverby commented Sep 20, 2024 via email

@bmaltais
Copy link
Owner

I will be away for a week so no rush. I will not be able to work on the GUI for quite a bit.

@bghira
Copy link

bghira commented Sep 22, 2024

mps has correctness issues and can't be relied on for training a model. however MLX or Tinygrad do not rely on MPS and have proper results. i've never seen good results from training on mps, and i've supported it in simpletuner since january.

@JoeyOverby
Copy link
Author

JoeyOverby commented Sep 22, 2024 via email

@bghira
Copy link

bghira commented Sep 30, 2024

the problem is probably an overflow inside pytorch's MPS code that has yet to be discovered. if you go to the pytorch issue tracker and search for label:mps is:open you will see the problem.

the only only reason to support MPS for pytorch in this repo (or the original) is for the maintainer of the repo to be able to directly run the code on their apple development workstation. this is the only reason that i have it supported in simpletuner.

apple machines have upsides for ML development:

  • they are very power-efficient. i live in a country without a very reliable power grid, and we use solar.
  • they have extremely fast CPUs, and eg. quantise weights or perform image transforms faster than Intel or AMD can at the highest end
  • the CPU mode in pytorch is correct when MPS is not, and pytorch on Apple M3 CPU is surprisingly fast

they have downsides:

  • the code required to support Apple systems often comes at a detriment to the entire codebase
  • for example you cannot rely on the existence of things like autocast or CUDA streams
  • torch compile encounters branching problems when you have to check for MPS systems (example, the Flux RoPE code uses float64 on NVIDIA but fp64 isn't available on MPS so it falls back to fp32, which torch compile gets confused and unhappy about)
  • dtype handling is different between the two platforms, where you will oddly encounter situations where the same code runs improperly on one vs the other
    • torchao / quanto (quantisation) introduce more problems here
    • CUDA seems fine with mixing bf16 and fp32 compute sometimes while MPS is never happy with this situation
  • CUDA extensions and custom kernels don't work on MPS and you'll be frustrated to discover just how much of the ecosystem relies on these things

it can be almost a part-time job to keep MPS and CUDA working together, and in the case of pytorch, it's actually several full-time jobs on their end.

if bmaltais or kohya_tech personally never have to run on MPS then i would just run as far and as fast as possible in the opposite direction and never touch that stack. Apple users are better-served by an architecture-specific training framework, if one even exists.

@cchance27
Copy link

for example you cannot rely on the existence of things like autocast or CUDA streams

I believe nightly is adding AMP for MPS now.

Apple users are better-served by an architecture-specific training framework, if one even exists.

Not gonna lie i wish someone would work an a sd training script on mlx :S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants