Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert mds checkpoint to Hf Llama model #394

Open
wants to merge 322 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
322 commits
Select commit Hold shift + click to select a range
82555c8
Update `set_params.sh`
saforem2 Feb 5, 2024
9dc3482
Update `pretrain_gpt_alcf.py`
saforem2 Feb 5, 2024
41208b4
Update `.gitignore`
saforem2 Feb 6, 2024
42f5eb9
Update `set_params.sh`
saforem2 Feb 7, 2024
8b6d8bf
adding master_addr in case it doesn
zhenghh04 Feb 8, 2024
9030d77
added socket
zhenghh04 Feb 8, 2024
7797c2f
Update `generate_config.sh`
saforem2 Feb 9, 2024
d1367d4
Update `set_params.sh,generate_config.sh`
saforem2 Feb 11, 2024
615faa4
Add `train_llama_alcf_chunk_00_of_20.sh`
saforem2 Feb 13, 2024
ce5d896
added profile as an option
zhenghh04 Feb 14, 2024
0395b2c
merge recent changes
zhenghh04 Feb 14, 2024
8280d01
Update `train_llama_alcf_chunk-00-of-20.sh`
saforem2 Feb 19, 2024
63385a8
Update `megatron/data/blendable_dataset.py`
saforem2 Feb 19, 2024
74e403f
Update `set_params.sh`
saforem2 Feb 19, 2024
cc6c880
Update `megatron/training.py`
saforem2 Feb 19, 2024
ead0bc6
Update `megatron/data/indexed_dataset.py`
saforem2 Feb 19, 2024
9cde45a
Update `megatron/data/gpt_dataset.py`
saforem2 Feb 19, 2024
bffa9f8
Update `pretrain_gpt_alcf.py`
saforem2 Feb 21, 2024
0a42723
Update `set_params.sh`
saforem2 Feb 21, 2024
2bf2083
Add `train_llama_alcf_polaris.sh`
saforem2 Feb 21, 2024
346ddc3
Update `train_llama_alcf_polaris.sh`
saforem2 Feb 21, 2024
bf53655
Remove `train_llama_alcf_chunk_00-of-20.sh`
saforem2 Feb 22, 2024
28ba58f
Add `train_llama_alcf_polaris.sh`
saforem2 Feb 22, 2024
7f9af8d
Update `train_llama_alcf_polaris.sh`
saforem2 Feb 23, 2024
7b00ac4
Update `train_llama_alcf_polaris.sh`
saforem2 Feb 23, 2024
9a3688f
Add `helpers_alcf.sh`
saforem2 Feb 23, 2024
c5cb6bf
Merge pull request #4 from zhenghh04/main
saforem2 Feb 24, 2024
3fe1606
Merge branch 'microsoft:main' into main
saforem2 Feb 24, 2024
5cfa556
removed unnecessary setup for master_port
zhenghh04 Feb 24, 2024
43629e6
Update `helpers_alcf.sh` for Aurora
saforem2 Feb 24, 2024
5a3dae7
Fix `unable to init args` in `pretrain_gpt_alcf.py`
saforem2 Feb 24, 2024
d44769c
Update `train_llama_alcf_polaris.sh`
saforem2 Feb 25, 2024
d3a2972
Update `pretrain_gpt_alcf.py`
saforem2 Feb 25, 2024
3b6b94a
Update `megatron/training.py`
saforem2 Feb 25, 2024
4628b9f
Add `train_llama_alcf_aurora.sh`
saforem2 Feb 25, 2024
ef1e83e
Renamed `llama_alcf.sh -> train_llama_alcf_polaris_hzheng.sh`
saforem2 Feb 25, 2024
4556931
Merge branch 'microsoft:main' into main
saforem2 Feb 26, 2024
3824801
Merge branch 'microsoft:main' into main
saforem2 Feb 26, 2024
34f72d7
add util files
Feb 26, 2024
d3f4a92
Merge branch 'microsoft:main' into main
saforem2 Feb 28, 2024
5a0fa30
Add `ALCF_utils/{test_blend.sh,test_blendable_dataset.py}`
saforem2 Feb 28, 2024
a4a08c9
Update `train_llama_alcf_aurora.sh`
saforem2 Feb 28, 2024
0daba44
Update `train_llama_alcf_polaris.sh`
saforem2 Feb 28, 2024
62ef3c5
Remove old `set_params.sh`
saforem2 Feb 28, 2024
6676252
Update `helpers_alcf.sh`
saforem2 Feb 28, 2024
747b568
Update `.gitignore`
saforem2 Feb 28, 2024
25358b7
fixed int8 issue
zhenghh04 Feb 28, 2024
21e880a
pull recent changes
zhenghh04 Feb 28, 2024
17f5b64
Merge pull request #5 from zhenghh04/main
saforem2 Feb 28, 2024
45a3428
Move `helpers_alcf.sh -> ALCF_utils/helpers_alcf.sh`
saforem2 Feb 28, 2024
a907b47
modifying testing dataset
zhenghh04 Feb 28, 2024
9435b63
Update `train_llama_alcf_aurora.sh`
saforem2 Feb 28, 2024
fdf1904
Set `skip_warmup=True` in `pretrain_gpt_alcf.py`
saforem2 Feb 28, 2024
802a6e8
test_blend_full
zhenghh04 Feb 28, 2024
6b33b81
changed multiprocessing context
zhenghh04 Mar 1, 2024
14f847c
Merge branch 'argonne-lcf:main' into main
zhenghh04 Mar 1, 2024
932f4ac
Merge pull request #6 from zhenghh04/main
saforem2 Mar 1, 2024
ea9f6e3
Update Aurora qsub scripts
saforem2 Mar 4, 2024
67914f3
Update `megatron/data/data_samplers.py`
saforem2 Mar 5, 2024
9f84633
Add `ALCF_utils/data_file_list_polaris.txt`
saforem2 Mar 7, 2024
6bcac4e
Update `train_llama_alcf_polaris.sh`
saforem2 Mar 7, 2024
e7d76d3
Move `ALCF_utils/*` to `ALCF/*`
saforem2 Mar 7, 2024
7e6e263
Add (new) `ALCF/README.md`
saforem2 Mar 7, 2024
7133eb6
Update `train_llama_alcf_polaris.sh`
saforem2 Mar 7, 2024
eb89ee4
Update `ALCF/README.md`
saforem2 Mar 7, 2024
4076cf0
Update `ALCF/README.md`
saforem2 Mar 7, 2024
7e8a1a8
Update `train_llama_alcf_polaris.sh`
saforem2 Mar 7, 2024
13adb2e
Update `ALCF/helpers.sh`
saforem2 Mar 7, 2024
509e8ec
Update `ALCF/README.md`
saforem2 Mar 7, 2024
c272b4d
Update `ALCF/README.md`
saforem2 Mar 7, 2024
d9fc6e2
Update `ALCF/README.md`
saforem2 Mar 8, 2024
e21451e
Update `train_llama_alcf_polaris.sh`
saforem2 Mar 8, 2024
e332793
Update `ALCF/helpers.sh`
saforem2 Mar 8, 2024
a57a21f
Fix checkpointing issue with `torch=2.2.1` in `megatron/model/gpt_mod…
saforem2 Mar 8, 2024
fb7eaf7
Update README.md
saforem2 Mar 9, 2024
99d12f4
added support for multiprocessing_context
zhenghh04 Mar 12, 2024
d91b237
changed script to common environement
zhenghh04 Mar 12, 2024
7df1664
Update README.md
saforem2 Mar 12, 2024
9a95444
Pull in changes from `microsoft/Megatron-DeepSpeed@df0e2e4`
saforem2 Mar 13, 2024
9a2a739
Merge branch 'microsoft:main' into main
saforem2 Mar 13, 2024
79c3067
Update README.md
venkat-1 Mar 13, 2024
15d422e
fixing some NCCL issue and updated the script with the common environ…
zhenghh04 Mar 15, 2024
26598a1
Merge branch 'argonne-lcf:main' into main
zhenghh04 Mar 15, 2024
73dc82a
further allreduce check
zhenghh04 Mar 15, 2024
e039828
Merge branch 'main' of github.com:zhenghh04/Megatron-DeepSpeed
zhenghh04 Mar 15, 2024
4fe87d0
Update gpt_dataset.py
zhenghh04 Mar 15, 2024
39be395
Update blendable_dataset.py
zhenghh04 Mar 15, 2024
b0c2133
Update `ALCF/helpers.sh` for SunSpot
saforem2 Mar 15, 2024
e27f381
Add `train_llama_alcf_sunspot.sh`
saforem2 Mar 15, 2024
7bfef8f
Update `ALCF/helpers.sh`
saforem2 Mar 16, 2024
9c6b893
Update `train_llama_alcf_sunspot.sh`
saforem2 Mar 16, 2024
41a23f1
Add `ALCF/tokenizer.model`
saforem2 Mar 16, 2024
b95f96e
Add `generate_config_cpu_optimizer.sh`
saforem2 Mar 16, 2024
2049298
Catch `CPU_OPTIMIZER` in `ALCF/helpers.sh` @ `setParams()`
saforem2 Mar 16, 2024
f44f61a
Update `train_llama_alcf_polaris.sh`
saforem2 Mar 16, 2024
0527c71
Update `ALCF/helpers.sh`
saforem2 Mar 18, 2024
cb54fed
Update `train_llama_alcf_polaris.sh`
saforem2 Mar 18, 2024
b4b101d
Update `train_llama_alcf_sunspot.sh`
saforem2 Mar 18, 2024
3153825
Update README.md
saforem2 Mar 18, 2024
2d46039
Add `--train-iters-to-skip` option for skipping backprop on certain t…
saforem2 Mar 22, 2024
d96a585
[fix] `megatron/training.py`
saforem2 Mar 22, 2024
038109a
Add `--optimizer adamw` for `torch.optim.AdamW`
saforem2 Mar 26, 2024
66401b8
Update `pretrain_gpt_alcf.py`
saforem2 Mar 26, 2024
c87ea6a
Remove redundant `generate_config_cpu_optimizer.sh`
saforem2 Mar 26, 2024
37f6d3a
Update `train_llama_alcf_sunspot.sh`
saforem2 Mar 26, 2024
7befd20
Update `megatron/core/pipeline_parallel/schedules.py`
saforem2 Mar 26, 2024
25353fd
Update `ALCF/helpers.sh`
saforem2 Mar 26, 2024
29f1e30
Update `generate_config.sh`
saforem2 Mar 26, 2024
3363325
`assert args is not None` in `megatron/training.py`
saforem2 Mar 26, 2024
0dcba4f
Remove `"scheduler": {...}` from `generate_config.sh`
saforem2 Mar 26, 2024
367e4ae
Track optimizer states
saforem2 Mar 27, 2024
31644f0
Track optimizer states with W&B in `megatron/training.py`
saforem2 Mar 27, 2024
5273630
Update `pretrain_gpt_alcf.py`
saforem2 Mar 27, 2024
b4a310a
Update `train_llama_alcf_sunspot.sh`
saforem2 Mar 27, 2024
4f7ee53
Update `megatron/training.py`
saforem2 Mar 27, 2024
2f9cf05
Update `train_llama_alcf_{polaris,sunspot}.sh`
saforem2 Mar 27, 2024
5b9ad9a
Add support for `--optimizer={apex.adam,apex.sgd,adamw,adam,sgd}`
saforem2 Apr 2, 2024
294d81f
Add support for \`--optimizer={apex.adam,apex.sgd,adamw,adam,sgd}\`
saforem2 Apr 2, 2024
5c61fa5
Merge pull request #7 from zhenghh04/main
saforem2 Apr 2, 2024
86d961b
Merge branch 'microsoft:main' into main
saforem2 Apr 2, 2024
bed55a0
Update `ALCF/helpers.sh`
saforem2 Apr 4, 2024
fceb373
Update `megatron/data/data_samplers.py`
saforem2 Apr 4, 2024
07bb7bf
Add `train_llama_alcf.sh`
saforem2 Apr 4, 2024
3c1cdb4
Update `megatron/global_vars.py`
saforem2 Apr 4, 2024
5fe64ac
Turn on flops profiler in `generate_config.sh`
saforem2 Apr 4, 2024
316fd93
Update `megatron/model/language_model.py`
saforem2 Apr 4, 2024
69bb53e
Remove `--num-workers 0` in `train_llama_alcf_polaris.sh`
saforem2 Apr 4, 2024
2a36f14
Update `megatron/timers.py`
saforem2 Apr 4, 2024
c102ee2
Merge branch 'microsoft:main' into main
saforem2 Apr 4, 2024
58b1696
Update `ALCF/helpers.sh`
saforem2 Apr 4, 2024
9ac0159
Update `ALCF/helpers.sh`
saforem2 Apr 4, 2024
e54063b
Add `train_llama_nersc_perlmutter.sh`
saforem2 Apr 4, 2024
8ac8bdc
Update `{train_llama_alcf.sh,ALCF/helpers.sh}`
saforem2 Apr 4, 2024
8c6c91f
Update `megatron/training.py`
saforem2 Apr 4, 2024
7794fc0
Update `pretrain_gpt_alcf.py`
saforem2 Apr 4, 2024
590630e
Update `ALCF/helpers.sh`
saforem2 Apr 4, 2024
d03aac0
Update README.md
vksastry Apr 8, 2024
a66ee94
Merge branch 'microsoft:main' into main
saforem2 Apr 10, 2024
c72914f
Update `megatron/core/tensor_parallel/cross_entropy.py`
saforem2 Apr 16, 2024
7848cd4
Update `pretrain_gpt_alcf.py`
saforem2 Apr 16, 2024
f2b82b9
Removes old `train_sbatch_pp64.sh`
saforem2 Apr 16, 2024
6edd69d
Merge branch 'main' of https://github.com/argonne-lcf/Megatron-DeepSpeed
saforem2 Apr 16, 2024
5c3b5b7
Update `generate_config.sh`
saforem2 Apr 16, 2024
4e5c383
Add support for `schedulefree.{AdamWScheduleFree,SGDScheduleFree}`
saforem2 Apr 16, 2024
a68ed8e
update `train_llama_alcf.sh`
saforem2 Apr 16, 2024
a70aa6e
Update `pretrain_gpt_alcf.py`
saforem2 Apr 16, 2024
1de3c66
Fix checkpointing with `schedulefree.*` optimizers
saforem2 Apr 17, 2024
aa2cd59
Fix checkpointing with `schedulefree.*` optimizers
saforem2 Apr 17, 2024
a365a18
Add `--schedulefree-foreach` flag
saforem2 Apr 17, 2024
eecf70d
Replace `{:.6E}` with `{:.6f}` in `log_string` formatting
saforem2 Apr 17, 2024
6969dc2
Update `megatron/training.py`
saforem2 Apr 17, 2024
1d30d41
Update `ALCF/helpers.sh`
saforem2 Apr 18, 2024
981b7d9
Update logging in `megatron/training.py`
saforem2 Apr 18, 2024
7e6a3a4
Remove redundant `train_llama_alcf_*.sh`
saforem2 Apr 18, 2024
d243489
Update README.md
saforem2 Apr 18, 2024
78f3785
Update README.md
saforem2 Apr 18, 2024
2dc5aeb
Add `ALCF/data-lists/polaris/*.txt`
saforem2 Apr 18, 2024
27f66fd
Add support for DeepSpeed `FusedLamb` optimizer
saforem2 Apr 19, 2024
fc1b347
Add `ALCF/data-lists/sunspot/*.txt`
saforem2 Apr 19, 2024
80dc91c
Add support for `--optimizer ipex.{fused}lamb`
saforem2 Apr 19, 2024
36af9fc
Add support for `--optimizer ipex.{fused}lamb`
saforem2 Apr 19, 2024
997c39f
Update `ALCF/helpers.sh`
saforem2 Apr 19, 2024
e179fe0
Update `ALCF/helpers.sh`
saforem2 Apr 23, 2024
7f88a6e
Merge pull request #8 from argonne-lcf/hotfix-sirius
saforem2 Apr 23, 2024
133f244
Remove `apex` deps from `megatron/*`
saforem2 Apr 23, 2024
42a27fb
Move `generate_config.sh` logic into `ALCF/helpers.sh`
saforem2 Apr 23, 2024
3be7efc
Add option to launch with `mpiexec`
saforem2 Apr 23, 2024
a8a9a59
Update `train_llama_alcf.sh`
saforem2 Apr 24, 2024
42140d7
Update `ALCF/helpers.sh`
saforem2 Apr 24, 2024
fa0c5a6
Update `ALCF/helpers.sh`, `train_llama_alcf.sh`
saforem2 Apr 24, 2024
4b9c2f2
Add `ALCF/sunspot-env.sh`
saforem2 Apr 24, 2024
c2e9147
Update `train_llama_alcf.sh`, `ALCF/helpers.sh`
saforem2 Apr 24, 2024
41a3f35
Update `ALCF/helpers.sh`
saforem2 Apr 24, 2024
71c725e
Much faster check if `ezpz` installed
saforem2 Apr 24, 2024
ae0b4d8
Add option to run in `DEBUG` mode (i.e. `set -euxo pipefail`)
saforem2 Apr 24, 2024
2d6608a
Update `ALCF/data-lists/sunspot/*.txt`
saforem2 Apr 24, 2024
3648af5
Add `ALCF/test_sunspot.sh`
saforem2 Apr 24, 2024
9796eac
Add `ALCF/data-lists/sirius/books.txt`
saforem2 Apr 25, 2024
7b2ab6d
Add `ALCF/test_sirius.sh`
saforem2 Apr 25, 2024
58cdcca
Update `ALCF/test_sirius.sh`
saforem2 Apr 25, 2024
3145945
Merge pull request #9 from argonne-lcf/remove-apex-deps
saforem2 Apr 25, 2024
02a955c
Create `alcf-tests` branch
saforem2 Apr 25, 2024
23c9531
Update `ALCF/{test_sirius.sh,test_sunspot.sh}`
saforem2 Apr 25, 2024
5fff0af
Update `pretrain_gpt_alcf.py`
saforem2 Apr 25, 2024
005272b
Update `pretrain_gpt_alcf.py`
saforem2 Apr 25, 2024
fdb1707
Remove `ds_report` from `train_llama_alcf.sh`
saforem2 Apr 25, 2024
936c423
Update `.gitignore`
saforem2 Apr 25, 2024
a59a532
Update `ALCF/test_{sunspot,sirius}.sh`
saforem2 Apr 25, 2024
7681642
Merge pull request #10 from argonne-lcf/alcf-tests
saforem2 Apr 25, 2024
c9c87d9
Update `ALCF/data-lists/polaris/*.txt`
saforem2 Apr 26, 2024
caa1a4b
Add `ALCF/test_polaris.sh`
saforem2 Apr 26, 2024
b534e09
Fix duplicate loggers in `pretrain_gpt_alcf.py`
saforem2 Apr 26, 2024
2c4d772
Update `ALCF/helpers.sh`
saforem2 Apr 26, 2024
cfa6b52
Update `ALCF/test_{polaris,sirius,sunspot}.sh`
saforem2 Apr 26, 2024
a3114bf
Add `ALCF/data-lists/sunspot/dolma_v1_7_file_list.txt`
saforem2 Apr 26, 2024
f63aad1
Update `ALCF/helpers.sh`
saforem2 Apr 26, 2024
d329801
Update `ALCF/helpers.sh`
saforem2 Apr 26, 2024
585c15e
Update `train_llama_alcf.sh`
saforem2 Apr 26, 2024
3444b99
Update `.gitignore`
saforem2 Apr 26, 2024
505aef0
Update defaults in `ALCF/helpers.sh`
saforem2 Apr 26, 2024
e31bb23
Add `train_agpt.sh`
saforem2 Apr 26, 2024
a73e9af
Update `ALCF/test_{polaris,sirius,sunspot}.sh`
saforem2 Apr 26, 2024
57f1c96
Add `ALCF/test_alcf.sh`
saforem2 Apr 26, 2024
455126c
Update `ALCF/helpers.sh`
saforem2 Apr 26, 2024
2a49f6d
Update `train_agpt.sh`
saforem2 Apr 26, 2024
482dffd
Update `train_llama_alcf.sh`
saforem2 Apr 26, 2024
c04c42d
Update `ALCF/helpers.sh`
saforem2 Apr 27, 2024
36fa520
Fix for `conda/2024-04-29` on Polaris
saforem2 May 1, 2024
3b83b36
Add `train_agpt_polaris_7B_production.sh`
saforem2 May 1, 2024
a916a8d
Update `ALCF/helpers.sh` on Sunspot
saforem2 May 1, 2024
5257721
Update `ALCF/test_alcf.sh`
saforem2 May 8, 2024
ef30463
Merge pull request #11 from argonne-lcf/polaris-cuda122
saforem2 May 8, 2024
1ad039c
Add `train_agpt_polaris_7B_production_NCCL_OFI.sh`
saforem2 May 9, 2024
328dfda
Create `flash-attn-fix` branch
saforem2 May 15, 2024
bd8cb09
Add + pass default `LR_DECAY_ITERS`
saforem2 May 15, 2024
59f6052
Add `aGPT_7B.sh`
saforem2 May 15, 2024
9f98c09
Update `.gitignore`
saforem2 May 15, 2024
2cc2965
Rename `aGPT_7B.sh` -> `train_aGPT_7B.sh`
saforem2 May 15, 2024
a21870e
Merge branch 'microsoft:main' into main
saforem2 May 15, 2024
ed4ceae
Merge branch 'microsoft:main' into flash-attn-fix
saforem2 May 15, 2024
05e8af3
Add `ALCF/aws_ofi_nccl_plugin.sh` for Polaris
saforem2 May 16, 2024
6b4ea4c
Update `ALCF/{helpers.sh,train_llama_alcf.sh}`
saforem2 May 16, 2024
14970b9
Update `ALCF/helpers.sh` on Sunspot
saforem2 May 16, 2024
d1aec5d
Update `ALCF/sunspot-env.sh` with new modules for `anl_24_q2_release`
saforem2 May 16, 2024
7b8c819
Merge pull request #13 from argonne-lcf/flash-attn-fix
saforem2 May 16, 2024
13700cf
Update `ALCF/helpers.sh` with new release on Sunspot
saforem2 May 17, 2024
530c7c8
Update `train_llama_alcf.sh`
saforem2 May 17, 2024
b8cb2e8
Update `train_aGPT_7B.sh`
saforem2 May 17, 2024
0f18031
Add `setup_venv_from_conda` fn to `ALCF/helpers.sh`
saforem2 May 20, 2024
061e2cc
Update `train_aGPT_7B.sh`
saforem2 May 20, 2024
47bf9b5
Update `train_llama_alcf.sh`
saforem2 May 20, 2024
e68d270
Update `ALCF/README.md`
saforem2 May 20, 2024
4dd51dd
Merge pull request #14 from argonne-lcf/sunspot-frameworks-tests
saforem2 May 20, 2024
ac414a0
Update README.md
saforem2 May 20, 2024
9aa7fab
Fix path in `prof.export_chrome_trace()` from `pretrain_gpt_alcf.py`
saforem2 May 20, 2024
7d20359
Merge pull request #15 from argonne-lcf/fix-trace-output-path
saforem2 May 20, 2024
2f01543
Update README.md
saforem2 May 23, 2024
13171c2
Update README.md
saforem2 May 24, 2024
b371742
Add `setup_tokenizer_and_data()` function to `ALCF/helpers.sh`
saforem2 May 24, 2024
d93fb7f
Update `train_llama_alcf.sh`
saforem2 May 24, 2024
05d82c3
Update `train_aGPT_7B.sh`
saforem2 May 24, 2024
6de8496
Update `ALCF/README.md`
saforem2 May 24, 2024
03aa7c1
Update `ALCF/helpers.sh`
saforem2 May 24, 2024
3cd3f1a
Update `train_aGPT_7B.sh`
saforem2 May 24, 2024
bc1dbfd
Fix `--data-cache-path` in `ALCF/helpers.sh, train_llama_alcf.sh`
saforem2 May 24, 2024
c3a4451
Add `ALCF/sunspot-env-2024-04-15-002.sh`
saforem2 May 25, 2024
0fc3919
Update `train_aGPT_7B.sh`
saforem2 May 25, 2024
318d860
Merge branch 'tokenizer-tests' of https://github.com/argonne-lcf/Mega…
saforem2 May 25, 2024
c7a20cf
Merge pull request #17 from argonne-lcf/tokenizer-tests
saforem2 May 25, 2024
2b5b41f
convert MDS checkpoint to Hf Llama model
vksastry May 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 49 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,51 @@
# User Added
**.e1**
**.o1**
deps/*
OUTPUTS/*
ALCF/OUTPUTS/*
*tmp*
*core.*
*old*
*.bak
**index-cache**
**pbslogs**
ezpz
*.o17*
*.e17*
*hostfile*
.deepspeed_env
*.DS_Store
old/*
**venv**
*.json
*.o1
*.e1
outputs/
venvs/
wandb/
llama-logs/
checkpoints/
*.gz
*.txt
*.idx
*.bin
*.log
__pycache__

.deepspeed_env
*.bak
.cache/*
outputs/
venvs/
wandb/
llama-logs/
checkpoints/
*.gz
*.txt
*.idx
*.bin
*.log
__pycache__

# Distribution / packaging
Expand All @@ -20,4 +68,4 @@ slurm*
logs

# Data folder
bookcorpus_data/
bookcorpus_data/
798 changes: 798 additions & 0 deletions ALCF/README.md

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions ALCF/aws_ofi_nccl_plugin.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash --login

# AWS NCCL OFI Plugin settings below
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
#########################################################
# WARNING: !!!
# - Currently, `export NCCL_NET_GDR_LEVEL=PHB`
# causes a hang on Polaris.
# so, we don't set it for the time being [2024-05-14].
# - Seems to work on Perlmutter ???
#
# export NCCL_NET_GDR_LEVEL=PHB
#########################################################
16 changes: 16 additions & 0 deletions ALCF/data-lists/polaris/algebraic.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
0.0018520780893211373 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0000_text_document
0.0017591050606817512 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0001_text_document
0.001459052794333798 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0002_text_document
0.0007405667281569194 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0003_text_document
0.00019420030110896795 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0004_text_document
0.0009008668715801845 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0005_text_document
0.00015115827957143057 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0006_text_document
0.0014552844319220648 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0007_text_document
0.0012469861325685161 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0008_text_document
0.00136412011372413 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0009_text_document
0.0007064279699221103 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0010_text_document
0.0008472240000687427 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0011_text_document
0.0001984375713341955 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0012_text_document
0.0005472773881697123 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0013_text_document
0.001815779629850992 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0014_text_document
0.0018313600689757324 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/algebraic-stack-train-0015_text_document
100 changes: 100 additions & 0 deletions ALCF/data-lists/polaris/arxiv.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
0.0002583902668716813 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0000_text_document
0.0002646575141232155 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0001_text_document
0.0003165521247456758 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0002_text_document
0.0002920706460176214 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0003_text_document
0.00028396813182810215 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0004_text_document
0.00030445161883108107 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0005_text_document
0.00031628781276576474 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0006_text_document
0.0003083776568189157 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0007_text_document
0.0003176359471472902 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0008_text_document
0.0002536009369131698 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0009_text_document
0.0003067491424681363 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0010_text_document
0.0002597217257557784 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0011_text_document
0.0003788556450109768 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0012_text_document
0.0002796563272052598 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0013_text_document
0.00033573826524290287 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0014_text_document
0.00030523658022800287 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0015_text_document
0.00032211552192240096 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0016_text_document
0.0003329295675164247 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0017_text_document
0.0003101982186639862 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0018_text_document
0.00032361798234223355 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0019_text_document
0.0003495541581652915 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0020_text_document
0.0002821637448858042 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0021_text_document
0.00030399523537629673 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0022_text_document
0.0002955658968247219 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0023_text_document
0.00028942158502924254 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0024_text_document
0.00028769546171490733 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0025_text_document
0.0002938111057234182 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0026_text_document
0.0002711150403010948 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0027_text_document
0.00031130095874747565 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0028_text_document
0.0003002996118160777 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0029_text_document
0.0003732757901604459 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0030_text_document
0.00026784205751795894 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0031_text_document
0.0002799626521661984 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0032_text_document
0.00034334276069078164 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0033_text_document
0.0003582469803674965 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0034_text_document
0.00031094844818418623 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0035_text_document
0.0002766228384977191 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0036_text_document
0.00030297116159471485 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0037_text_document
0.00027033888377464685 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0038_text_document
0.00030090862368377933 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0039_text_document
0.00028543875802490955 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0040_text_document
0.00027559768459074204 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0041_text_document
0.0003182185533962886 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0042_text_document
0.0003311392971435837 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0043_text_document
0.00028751652060804325 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0044_text_document
0.000303466863212589 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0045_text_document
0.00033400462801277524 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0046_text_document
0.0002589234031777426 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0047_text_document
0.0002913508598466723 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0048_text_document
0.0002670572450004856 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0049_text_document
0.00032027399105647656 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0050_text_document
0.00032188376258379377 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0051_text_document
0.0003161585784100882 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0052_text_document
0.0003184249182974135 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0053_text_document
0.00030381336664000807 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0054_text_document
0.0003190437442184283 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0055_text_document
0.0002537961798200545 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0056_text_document
0.0003017817117223326 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0057_text_document
0.00028685268513240224 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0058_text_document
0.00031265179094451165 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0059_text_document
0.00034708319096986816 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0060_text_document
0.00026650837943080664 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0061_text_document
0.00034588832248507335 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0062_text_document
0.0002416982248399037 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0063_text_document
0.0003089296918222243 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0064_text_document
0.00029137184185700827 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0065_text_document
0.00026464226846800774 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0066_text_document
0.00030545397919456627 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0067_text_document
0.0003206778460448875 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0068_text_document
0.00030968971641110967 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0069_text_document
0.00023325653928600864 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0070_text_document
0.00030526899198338555 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0071_text_document
0.00035376719076633584 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0072_text_document
0.000290224385981026 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0073_text_document
0.000294650083382008 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0074_text_document
0.00028768858128616436 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0075_text_document
0.00030856965235527843 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0076_text_document
0.00030579942447879054 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0077_text_document
0.0002863101084704357 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0078_text_document
0.0002870032092492213 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0079_text_document
0.000264182727569885 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0080_text_document
0.0002974012367036449 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0081_text_document
0.00032238412143059203 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0082_text_document
0.00031683716893819036 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0083_text_document
0.00031157434937617524 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0084_text_document
0.0003411742735695989 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0085_text_document
0.00026778444816570715 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0086_text_document
0.0003037045797275201 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0087_text_document
0.00027746114370081314 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0088_text_document
0.00027148285946862043 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0089_text_document
0.00028042950114678207 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0090_text_document
0.0003235607816590721 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0091_text_document
0.0003086692227306295 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0092_text_document
0.00033990349455148105 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0093_text_document
0.00030945053208470265 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0094_text_document
0.00027309074552265303 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0095_text_document
0.00028737393506316194 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0096_text_document
0.0003098868328009879 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0097_text_document
0.0002614229162588409 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0098_text_document
0.0002884388407820923 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/arxiv-0099_text_document
3 changes: 3 additions & 0 deletions ALCF/data-lists/polaris/books.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
0.006 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0000_text_document
0.006 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0001_text_document
0.006 /eagle/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0002_text_document
Loading