Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BytePair special tokens tokenization #1447

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
91ff01d
Fix BP special tokens tokenization
abuelnasr0 Feb 20, 2024
6c26d84
Add test to Bart
abuelnasr0 Feb 20, 2024
b4769ea
Add test to Bloom
abuelnasr0 Feb 20, 2024
814836b
Remove ToDo comment
abuelnasr0 Feb 20, 2024
f451351
Add tests for Roberta
abuelnasr0 Feb 20, 2024
464555c
Remove roberta todo comment
abuelnasr0 Feb 20, 2024
ab7b48a
Fix split comment
abuelnasr0 Feb 21, 2024
996fc48
Update our sampler documentation to reflect usage (#1444)
mattdangerw Feb 20, 2024
9da7400
Add Gemma model (#1448)
mattdangerw Feb 21, 2024
f75d8cb
Update to the newest version of Gemma on Kaggle (#1454)
mattdangerw Feb 22, 2024
cd5e33c
Add dtype arg to Gemma HF conversion script (#1452)
nkovela1 Feb 22, 2024
e2624a1
Fix gemma testing import (#1462)
mattdangerw Feb 23, 2024
4a0adf2
Add docstring for PyTorch conversion script install instructions (#1471)
nkovela1 Feb 27, 2024
6c642c8
Add an annotation to tests that need kaggle auth (#1470)
mattdangerw Feb 27, 2024
4ba3ca7
Fix Mistral memory consumption with JAX and default dtype bug (#1460)
tirthasheshpatel Feb 27, 2024
5d22424
Bump the master version to 0.9 (#1473)
mattdangerw Feb 27, 2024
3db86d1
Pin to TF 2.16 RC0 (#1478)
sampathweb Feb 28, 2024
414b4f4
Fix gemma rms_normalization's use of epsilon (#1472)
cpsauer Feb 28, 2024
8590c22
Add `FalconBackbone` (#1475)
SamanehSaadat Mar 1, 2024
c739f81
CI - Add kaggle creds to pull model (#1459)
sampathweb Mar 4, 2024
134f8b7
Update reversible_embedding.py (#1484)
TheCrazyT Mar 4, 2024
c1b6b54
doc fix for constrastive sampler (#1488)
mattdangerw Mar 5, 2024
f3eda3c
Remove broken link to masking and padding guide (#1487)
mattdangerw Mar 5, 2024
7f692ca
Fix a typo. (#1489)
SamanehSaadat Mar 5, 2024
8851624
Fix dtype accessors of tasks/backbones (#1486)
mattdangerw Mar 6, 2024
f92d4f8
Auto-labels 'gemma' on 'gemma' issues/PRs. (#1490)
shmishra99 Mar 6, 2024
3cacebd
Add BloomCausalLM (#1467)
abuelnasr0 Mar 6, 2024
536e1ba
Remove the bert jupyter conversion notebooks (#1492)
mattdangerw Mar 7, 2024
7e1362f
Add `FalconTokenizer` (#1485)
SamanehSaadat Mar 8, 2024
1848224
Add Falcon Preprocessor. (#1498)
SamanehSaadat Mar 8, 2024
49c243b
Rename 176B presets (#1496)
abuelnasr0 Mar 8, 2024
865034d
Add bloom presets (#1501)
abuelnasr0 Mar 11, 2024
786aa94
Create workflow for auto assignment of issues and for stale issues (…
sachinprasadhs Mar 11, 2024
8698f84
Update requirements to TF 2.16 GA (#1503)
sampathweb Mar 11, 2024
29a87cb
Expose Task and Backbone (#1506)
mattdangerw Mar 11, 2024
7e3dfc8
Clean up and add our gemma conversion script (#1493)
mattdangerw Mar 11, 2024
8c94113
Don't auto-update JAX GPU (#1507)
sampathweb Mar 12, 2024
81dd7b5
Keep rope at float32 precision (#1497)
grasskin Mar 13, 2024
0b0305a
Bump the python group with 2 updates (#1509)
dependabot[bot] Mar 13, 2024
f29aff8
Fixes for the LLaMA backbone + add dropout (#1499)
tirthasheshpatel Mar 13, 2024
34d2099
Add `LlamaPreprocessor` and `LlamaCausalLMPreprocessor` (#1511)
tirthasheshpatel Mar 13, 2024
0ef44ff
Always run the rotary embedding layer in float32 (#1508)
tirthasheshpatel Mar 14, 2024
1cc8df5
Remove install of Python 3.9 (#1514)
sampathweb Mar 14, 2024
db855bc
Update gemma_backbone.py for sharding config. (#1491)
qlzh727 Mar 14, 2024
d1031df
Unify docstring style
sachinprasadhs Mar 20, 2024
2acb4c9
Revert "Unify docstring style"
mattdangerw Mar 20, 2024
898329f
Docs/modelling layers (#1502)
mykolaskrynnyk Mar 20, 2024
5944635
Standardize docstring (#1516)
sachinprasadhs Mar 20, 2024
3ddfd88
Support tokenization of special tokens for word_piece_tokenizer (#1397)
abuelnasr0 Mar 20, 2024
c3b2c09
Upload Model to Kaggle (#1512)
SamanehSaadat Mar 25, 2024
eb4ef20
Add scoring mode to MistralCausalLM (#1521)
RyanMullins Mar 25, 2024
f1714e1
Add Mistral Instruct V0.2 preset (#1520)
tirthasheshpatel Mar 25, 2024
6703d76
Add Tests for Kaggle Upload Validation (#1524)
SamanehSaadat Mar 26, 2024
6a8166e
Add presets for Electra and checkpoint conversion script (#1384)
pranavvp16 Mar 26, 2024
6ea1e63
Allow saving / loading from Huggingface Hub preset (#1510)
Wauplin Mar 27, 2024
6e946e2
Stop on multiple end tokens (#1518)
grasskin Mar 27, 2024
e5b2833
Update mistral_tokenizer.py (#1528)
asmith26 Mar 27, 2024
2be333c
Add lora example to GemmaCausalLM docstring (#1527)
SamanehSaadat Mar 27, 2024
859b1bf
Add LLaMA Causal LM with 7B presets (#1526)
tirthasheshpatel Mar 28, 2024
f8aba3c
Add task base classes (#1517)
mattdangerw Mar 28, 2024
db831d7
Doc fixes (#1530)
mattdangerw Mar 28, 2024
8c7aa4d
Run the LLaMA and Mistral RMS Layer Norm in float32 (#1532)
tirthasheshpatel Mar 29, 2024
1192db4
Adds score API to GPT-2 (#1533)
RyanMullins Mar 29, 2024
035a776
increase pip timeout to 1000s to avoid connection resets (#1535)
sampathweb Mar 29, 2024
298e15c
Adds the score API to LlamaCausalLM (#1534)
RyanMullins Mar 29, 2024
91aa654
Implement compute_output_spec() for tokenizers with vocabulary. (#1523)
briango28 Mar 29, 2024
d95c271
Remove staggler type annotiations (#1536)
mattdangerw Mar 29, 2024
dcebc7c
Always run SiLU activation in float32 for LLaMA and Mistral (#1540)
tirthasheshpatel Apr 1, 2024
29873a9
Bump the python group with 2 updates (#1538)
dependabot[bot] Apr 1, 2024
d0ff826
Add special_tokens_in_strings to byte_pair_tokenizer
abuelnasr0 Apr 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,6 @@ updates:
python:
patterns:
- "*"
ignore:
# ignore all updates for JAX GPU due to cuda version issue
- dependency-name: "jax[cuda12_pip]"
21 changes: 21 additions & 0 deletions .github/workflows/auto-assignment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: auto-assignment
on:
issues:
types:
- opened

permissions:
contents: read
issues: write
pull-requests: write

jobs:
welcome:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/github-script@v7
with:
script: |
const script = require('./\.github/workflows/scripts/auto-assignment.js')
script({github, context})
42 changes: 42 additions & 0 deletions .github/workflows/labeler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright 2024 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# This workflow automatically identifies issues and pull requests (PRs)
# related to Gemma. It searches for the keyword "Gemma" (case-insensitive)
# in both the title and description of the issue/PR. If a match is found,
# the workflow adds the label 'Gemma' to the issue/PR.

name: 'Labeler'
on:
issues:
types: [edited, opened]
pull_request_target:
types: [opened, edited]

permissions:
contents: read
issues: write
pull-requests: write

jobs:
welcome:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/github-script@v7
with:
script: |
const script = require('./\.github/workflows/scripts/labeler.js')
script({github, context})
43 changes: 43 additions & 0 deletions .github/workflows/scripts/auto-assignment.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
/** Automatically assign issues and PRs to users in the `assigneesList`
* on a rotating basis.

@param {!object}
GitHub objects can call GitHub APIs using their built-in library functions.
The context object contains issue and PR details.
*/

module.exports = async ({ github, context }) => {
let issueNumber;
let assigneesList;
// Is this an issue? If so, assign the issue number. Otherwise, assign the PR number.
if (context.payload.issue) {
//assignee List for issues.
assigneesList = ["SuryanarayanaY", "sachinprasadhs"];
issueNumber = context.payload.issue.number;
} else {
//assignee List for PRs.
assigneesList = [mattdangerw];
issueNumber = context.payload.number;
}
console.log("assignee list", assigneesList);
console.log("entered auto assignment for this issue: ", issueNumber);
if (!assigneesList.length) {
console.log("No assignees found for this repo.");
return;
}
let noOfAssignees = assigneesList.length;
let selection = issueNumber % noOfAssignees;
let assigneeForIssue = assigneesList[selection];

console.log(
"issue Number = ",
issueNumber + " , assigning to: ",
assigneeForIssue
);
return github.rest.issues.addAssignees({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
assignees: [assigneeForIssue],
});
};
53 changes: 53 additions & 0 deletions .github/workflows/scripts/labeler.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/*
Copyright 2024 Google LLC. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/


/**
* Invoked from labeler.yaml file to add
* label 'Gemma' to the issue and PR for which have gemma keyword present.
* @param {!Object.<string,!Object>} github contains pre defined functions.
* context Information about the workflow run.
*/

module.exports = async ({ github, context }) => {
const issue_title = context.payload.issue ? context.payload.issue.title : context.payload.pull_request.title
let issue_description = context.payload.issue ? context.payload.issue.body : context.payload.pull_request.body
const issue_number = context.payload.issue ? context.payload.issue.number : context.payload.pull_request.number
const keyword_label = {
gemma:'Gemma'
}
const labelsToAdd = []
console.log(issue_title,issue_description,issue_number)
if (issue_description==null)
{
issue_description = ''
}

for(const [keyword, label] of Object.entries(keyword_label)){
if(issue_title.toLowerCase().indexOf(keyword) !=-1 || issue_description.toLowerCase().indexOf(keyword) !=-1 ){
console.log(`'${keyword}'keyword is present inside the title or description. Pushing label '${label}' to row.`)
labelsToAdd.push(label)
}
}
if(labelsToAdd.length > 0){
console.log(`Adding labels ${labelsToAdd} to the issue '#${issue_number}'.`)
github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
labels: labelsToAdd
})
}
};
50 changes: 50 additions & 0 deletions .github/workflows/stale-issue-pr.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Close inactive issues
on:
schedule:
- cron: "30 1 * * *"
jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- name: Awaiting response issues
uses: actions/stale@v9
with:
days-before-issue-stale: 14
days-before-issue-close: 14
stale-issue-label: "stale"
# reason for closed the issue default value is not_planned
close-issue-reason: completed
only-labels: "stat:awaiting response from contributor"
stale-issue-message: >
This issue is stale because it has been open for 14 days with no activity.
It will be closed if no further activity occurs. Thank you.
# List of labels to remove when issues/PRs unstale.
labels-to-remove-when-unstale: "stat:awaiting response from contributor"
close-issue-message: >
This issue was closed because it has been inactive for 28 days.
Please reopen if you'd like to work on this further.
days-before-pr-stale: 14
days-before-pr-close: 14
stale-pr-message: "This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you."
close-pr-message: "This PR was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further."
repo-token: ${{ secrets.GITHUB_TOKEN }}
- name: Contribution issues
uses: actions/stale@v9
with:
days-before-issue-stale: 180
days-before-issue-close: 365
stale-issue-label: "stale"
# reason for closed the issue default value is not_planned
close-issue-reason: not_planned
any-of-labels: "stat:contributions welcome,good first issue"
# List of labels to remove when issues/PRs unstale.
labels-to-remove-when-unstale: "stat:contributions welcome,good first issue"
stale-issue-message: >
This issue is stale because it has been open for 180 days with no activity.
It will be closed if no further activity occurs. Thank you.
close-issue-message: >
This issue was closed because it has been inactive for more than 1 year.
repo-token: ${{ secrets.GITHUB_TOKEN }}
29 changes: 21 additions & 8 deletions .kokoro/github/ubuntu/gpu/build.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,20 @@
set -e
set -x

cd "${KOKORO_ROOT}/"
export KAGGLE_KEY="$(cat ${KOKORO_KEYSTORE_DIR}/73361_keras_kaggle_secret_key)"
export KAGGLE_USERNAME="$(cat ${KOKORO_KEYSTORE_DIR}/73361_keras_kaggle_username)"

if [[ -z "${KAGGLE_KEY}" ]]; then
echo "KAGGLE_KEY is NOT set"
exit 1
fi

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
if [[ -z "${KAGGLE_USERNAME}" ]]; then
echo "KAGGLE_USERNAME is NOT set"
exit 1
fi

set -x
cd "${KOKORO_ROOT}/"

PYTHON_BINARY="/usr/bin/python3.9"

Expand All @@ -24,23 +35,25 @@ pip install -U pip setuptools psutil
if [ "${KERAS2:-0}" == "1" ]
then
echo "Keras2 detected."
pip install -r requirements-common.txt --progress-bar off
pip install tensorflow-text==2.15 tensorflow[and-cuda]~=2.15 keras-core
pip install -r requirements-common.txt --progress-bar off --timeout 1000
pip install tensorflow-text==2.15 tensorflow[and-cuda]~=2.15 keras-core \
--timeout 1000

elif [ "$KERAS_BACKEND" == "tensorflow" ]
then
echo "TensorFlow backend detected."
pip install -r requirements-tensorflow-cuda.txt --progress-bar off
pip install -r requirements-tensorflow-cuda.txt --progress-bar off \
--timeout 1000

elif [ "$KERAS_BACKEND" == "jax" ]
then
echo "JAX backend detected."
pip install -r requirements-jax-cuda.txt --progress-bar off
pip install -r requirements-jax-cuda.txt --progress-bar off --timeout 1000

elif [ "$KERAS_BACKEND" == "torch" ]
then
echo "PyTorch backend detected."
pip install -r requirements-torch-cuda.txt --progress-bar off
pip install -r requirements-torch-cuda.txt --progress-bar off --timeout 1000
fi

pip install --no-deps -e "." --progress-bar off
Expand Down
18 changes: 18 additions & 0 deletions .kokoro/github/ubuntu/gpu/jax/continuous.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,23 @@ env_vars: {
value: "jax"
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_username"
}
}
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_secret_key"
}
}
}

# Set timeout to 60 mins from default 180 mins
timeout_mins: 60
18 changes: 18 additions & 0 deletions .kokoro/github/ubuntu/gpu/jax/presubmit.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,23 @@ env_vars: {
value: "jax"
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_username"
}
}
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_secret_key"
}
}
}

# Set timeout to 60 mins from default 180 mins
timeout_mins: 60
18 changes: 18 additions & 0 deletions .kokoro/github/ubuntu/gpu/keras2/continuous.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,23 @@ env_vars: {
value: "1"
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_username"
}
}
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_secret_key"
}
}
}

# Set timeout to 60 mins from default 180 mins
timeout_mins: 60
18 changes: 18 additions & 0 deletions .kokoro/github/ubuntu/gpu/keras2/presubmit.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,23 @@ env_vars: {
value: "1"
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_username"
}
}
}

before_action {
fetch_keystore {
keystore_resource {
keystore_config_id: 73361
keyname: "keras_kaggle_secret_key"
}
}
}

# Set timeout to 60 mins from default 180 mins
timeout_mins: 60
Loading