Skip to content

Implementation of the paper 'Towards Interpreting BERT for Reading Comprehension QA' for ML Reproducibility Challenge 2020.

License

Notifications You must be signed in to change notification settings

gchhablani/MLRC-2020-Towards-Interpreting-BERT-for-RCQA

Repository files navigation

ML-Reproducibility-2020 Towards Interpreting BERT for RCQA

This is our repository for the implementation of the paper Towards Interpreting BERT for Reading Comprehension Based QA as a part of the ML Reproducibility Challenge 2020.

Our report was, unfortunately, rejected. You can find it here: https://openreview.net/forum?id=LI1n_od-aEq

Paper Summary

The paper talks about using Integrated Gradients (IG) to identify layer roles for BERT in Reading Comprehension QA tasks - SQuAD v1.1 and DuoRC SelfRC. They use the IG to create a probability distribution over the sequence for each example. Then they create Jensen-Shannon Divergence heatmaps across layers for 1000 samples keeping the top-2 tokens retained and top-2 tokens removed to see if layers focus on different words. Then, they use the token-wise imporances to create word-wise importances and use top-5 words to see what is the % of predicted answers, context words (within window size of 5 around answer) and query words in the passage in each layer for all dev samples. They observe that layers focus more on answer and contextual words and less on query words as the layers progress. This means that later layers focus on answer and words around the answer span, while initial layers focus on the query words and possible answers. Then, they plot an example based on the word-importances for layers and t-SNE representation for each layer's representation. Finally, they check the quantifier questions ('how much', 'how many') and observe that the ratio of numerical words in top-5 words increases as the layers progress. This is surprising as the confidence of BERT is still very high on such questions and the Exact Match scores are also high.

Usage

Install Requirements

To install requirements:

pip install -r requirements.txt

Setting up the package

We create a package called 'src'. If you're running any script outside src, then you may not need to use this, but in case your requirements are not met directly, you can install the package using:

python setup.py install

Fine-tuning BERT

The fine-tuning requires two configuration file paths, one for the dataset, and one for the trainer.

The default dataset config for SQuAD is as follows:

dataset_name: squad #The dataset to be loaded from src.datasets
model_checkpoint: bert-base-uncased #Pretrained Tokenizer Name
max_length: 384 #Max Sequence Length
doc_stride: 128 #Document Stride

For DuoRC, we need local file paths as it is not available on HuggingFace datasets:

dataset_name: duorc_modified #The dataset to be loaded from src.datasets
model_checkpoint: bert-base-uncased #Pretrained Tokenizer Name
max_length: 384 #Max Sequence Length
doc_stride: 128 #Document Stride
squad_v2: false #Whether to include no answer examples
data_files:
  train: ./data/duorc/dataset/SelfRC_train.json # The path to train dataset JSON.
  validation: ./data/duorc/dataset/SelfRC_dev.json # The path to dev dataset JSON.

An example of train config:

#Args
model:
  pretrained_model_name: bert-base-uncased
args:
  output_dir: "/content/drive/My Drive/MLR/v1_style/squad/ckpts/squad-bert-base-uncased" ## Checkpoint Directory
  logging_dir: "/content/drive/My Drive/MLR/v1_style/squad/runs/" ## Log Directory
  evaluation_strategy: epoch
  per_device_train_batch_size: 6
  per_device_eval_batch_size: 8
  weight_decay: 0.01
  learning_rate: 3e-5
  num_train_epochs: 2
  adam_epsilon: 1e-6
  lr_scheduler_type: polynomial
  warmup_steps: 2950 # 10% of total train steps - (88524*2)/6 * 0.1
  logging_first_step: true
  logging_steps: 1000
  save_steps: 2000
  seed: 2020
  dataloader_num_workers: 4
trainer:
  pretrained_tokenizer_name: bert-base-uncased
  save_model_name: "/content/drive/My Drive/MLR/v1_style/squad/model/squad-bert-base-uncased-model" ## Path for final model.
misc:
  squad_v2: false
  raw_predictions_file: "/content/drive/My Drive/MLR/v1_style/squad/preds/squad_raw" ## Store the binary predictions
  metric_file: "/content/drive/My Drive/MLR/v1_style/squad/preds/squad.json" ## Store the evaluation result
  final_predictions_file: "/content/drive/My Drive/MLR/v1_style/squad/preds/squad_final_predictions.json" ## Store the final processed predictions per example.

If you do not wish to change the file paths, you can fine-tune the BertForQuestionAnswering model using the following commands:

  1. SQuAD v1.1
python train.py --train ./configs/train/squad/default.yaml --dataset ./configs/datasets/squad/default.yaml
  1. DuoRC SelfRC
python train.py --train ./configs/train/duorc_modified/default.yaml --dataset ./configs/datasets/duorc_modified/default.yaml

Running this command saves the processed predictions as a JSON file at the path specified in the trainer configuration, along with the checkpoints, final model, metrics, and logs at their respective paths.

In case you have a trained model at save_model_name from the train configuration, you can use --only_predict to get raw predictions, and processed predictions.

In case you already have the raw predictions file and just want to calculate the metrics, use --load_predictions with the above commands.

Integrated Gradients

Based on the predictions stored in JSON file during the training, you can calculate Integrated Gradients on a random sample and store token-wise and word-wise importances in a binary.

For this, a configuration file is needed. An example configuration file looks like:

# Config for Integrated Gradients for SQuAD
model_checkpoint: "/content/drive/My Drive/MLR/v1_style/squad/model/squad-bert-base-uncased-model" ##Model Checkpoint
device: cuda # Device to be used for Integrated Gradients
n_steps: 25 # Number of steps to use for Numerical Approximation
method: "riemann_right" # The method to be used in Captum's Integrated Gradients
internal_batch_size: 4 # The batch size to be used internally
n_samples: 1000 # The number of samples to do IG for
store_dir: "/content/drive/My Drive/MLR/v1_style/squad/IGv2/" # The path where the resulting binaries are stored
predictions_path: "/content/drive/My Drive/MLR/v1_style/squad/preds/squad_final_predictions.json" # The path where the predictions were stored during training.

The terminal command to run Integrated Gradients is:

python run_integrated_gradients.py --config ./configs/integrated_gradients/squad.yaml

This will store the samples (samples), token-wise importances(token_importances), and word-wise importances(word_importances) in binary files at the store_dir.

Quantifier Integrated Gradients

To run Integrated Gradients for Quantifier Questions, the command is same as that for Integrated Gradients. We ignore n_samples as we take Integrated Gradients for all the examples which have Quantifier Questions.

Running the same command stores the samples (samples), token-wise importances(token_importances), and word-wise importances(word_importances) in binary files at the store_dir/quantifier/.

Jensen-Shannon Divergence Heatmaps

To generate JSD Heatmaps, use the following command:

python generate_heatmaps.py --path <path to token importance scores> --name <name used to save> --topk <K important scores to be retained/removed>

This generates heatmaps (JSD_<name>_<topk>_Heatmap_Retained.png,JSD_<name>_<topk>_Heatmap_Removed.png) and binary files (Retained Map <name> <topk>,Removed Map <name> <topk>) containing the layer-wise JSD for all samples. In case you have the binary files, you can use the --load_binary option to avoid recalculation of JSD.

Semantic and POS Statistics

To generate Semantic Statistics, use the following command:

python generate_tables.py --path <path to word importance scores> --name <name used to save> --topk <K important scores to be checked> --window <window size to be used to find contextual words>

The generates the tables for Semantic Statistics and Part-of-Speech Statistics as A_Q_C <name> <topk> <window> Table.txt and POS <name> <topk> <window> Table.txt, respectively, in $\LaTeX$ format.

Visualization

To generate visualization for top-K words for a few layers, use the following command:

python generate_viz --path <path to word imporances> --name <name used to save> --topk <K important words to be considered>

This stores a HTML file with the name, and a random seed used to sample the example as <name>_<seed>_<topk>_viz.html.

t-SNE Representation

To generate t-SNE representations for a few layers for SQuAD, use the following command:

python generate_tsne.py --train ./configs/train/squad/default.yaml

This uses the predictions stored during fine-tuning to determine the word categories, and get layer-wise representations for the best feature.

Using this command with store 4 t-SNE plots in .jpg format.

Quantifier Predictions

To calculate EM and confidence on Quantifier, Non-quantifier and Quantifier Questions with more than one numerical word in the passage, use the following command:

python predict_quantifier.py --train ./configs/train/quad/default.yaml --dataset ./configs/datasets/squad/default.yaml

This takes the same dataset and train configurations as the training file, and uses the dataset, as well as the predictions stored in the JSON file.

The confidence scores are printed on the console, while the evaluation metric scores are stored in JSON files.

Quantifier Numerical Statistics

To generate tables for numerical words in top-k words in total numerical words in passage for Quantifier Questions, use the following command:

python generate_quantifier_tables.py --path <path to word importances> --name <name used to save> --topk <K important scores to be checked>

This command stores the results in a file name <name> <topk> Quantifier Table.txt in the $\LaTeX$ format.

Adding a New Dataset

If you wish to add a new dataset, you can simply extend the DuoRC dataset class or make a base class from it, and write your own convert_to_squad_format method for your dataset, and corresponding configuration.

Additionally, use our configmapper object to map the custom dataset to our registry, add the dataset to __init__.py in src/datasets, and finally import it in train.py.

Once this is done, you should be able to use train.py easily on your dataset without much modification.

Directory Structure

.
├── configs
│   ├── datasets
│   │   ├── duorc
│   │   │   ├── default.yaml
│   │   │   └── squad_v2.yaml
│   │   ├── duorc_modified
│   │   │   └── default.yaml
│   │   └── squad
│   │       ├── default.yaml
│   │       └── squad_v2.yaml
│   ├── integrated_gradients
│   │   ├── duorc.yaml
│   │   └── squad.yaml
│   └── train
│       ├── duorc
│       │   ├── default.yaml
│       │   └── squad_v2.yaml
│       ├── duorc_modified
│       │   └── default.yaml
│       └── squad
│           ├── default.yaml
│           └── squad_v2.yaml
├── data
│   ├── duorc
│   │   ├── dataset
│   │   │   ├── ParaphraseRC_dev.json
│   │   │   ├── ParaphraseRC_test.json
│   │   │   ├── ParaphraseRC_train.json
│   │   │   ├── SelfRC_dev.json
│   │   │   └── SelfRC_train.json
├── generate_heatmaps.py
├── generate_quantifier_tables.py
├── generate_tables.py
├── generate_tsne.py
├── generate_viz.py
├── html
│   ├── DuoRC_79828_5_viz.html
│   ├── DuoRC_998016_5_viz.html
│   ├── SQuAD_111386_5_viz.html
│   └── SQuAD_766771_5_viz.html
├── images
│   ├── JSD_DuoRC_2_Heatmap_Removed.png
│   ├── JSD_DuoRC_2_Heatmap_Retained.png
│   ├── JSD_SQuAD_2_Heatmap_Removed.png
│   ├── JSD_SQuAD_2_Heatmap_Retained.png
│   ├── tSNE_10_10.jpg
│   ├── tSNE_10_12.jpg
│   ├── tSNE_10_1.jpg
│   └── tSNE_10_5.jpg
├── predict_quantifier.py
├── README.md
├── requirements.txt
├── run_integrated_gradients.py
├── run_quantifier_ig.py
├── setup.py
├── src
│   ├── datasets
│   │   ├── duorc_modified.py
│   │   ├── duorc.py
│   │   ├── __init__.py
│   │   └── squad.py
│   ├── __init__.py
│   └── utils
│       ├── __init__.py
│       ├── integrated_gradients.py
│       ├── mapper.py
│       ├── misc.py
│       ├── postprocess.py
│       └── viz.py
└── train.py

Pre-trained Models and Results

We will be updating the pre-trained models and results post-review as the pre-trained checkpoints are huge in size and stored on Google Drive.

Implementation

The paper uses original BERT script to train and evaluate the model on both the datasets, while we use custom scripts based on HuggingFace datasets and transformers libraries.

Some salient differences between the two implementations:

Differences with Original Bert SQuAD script

Original script Our Implementation
Use a max query length of 64 for deciding the number of query tokens used. Doesn't consider max query length as we feel that the full question is needed. However, we will add a max query length option in the configuration soon.
Their doc stride is based on a sliding window approach. Our doc stride works on an overlap based approach, i.e. stride is the max overlap two features can have for an example.
Keep a track of the max_context features for the tokens using score = min(num_left_context, num_right_context) + 0.01 * doc_span.length so that they can filter start indices based on this. We don't use max_context features, yet.
Use a function to align the predictions after training. This function cleans the predicted answer of accents, tokenize on punctuations, and join the original text. Then the answer is stripped of spaces and compared and aligned with the original text in context and prediction. We don't use any function to clean the predictions after training which can significantly affect EM/F1 scores.

Datasets and Training

We use HuggingFace datasets and transformers libraries for training and evaluation of the models. We build our own dataset classes based using Dataset and DatasetDict classes internally.

We have added the datasets required - SQuAD v1.1 and DuoRC - by making submodules of their official repositories: SQuAD, DuoRC under the data directory, although we have only used the DuoRC files in our code. For SQuAD v1.1, we use HuggingFace Dataset's squad directly, but train bert-base-uncased on it from scratch using the parameters as the original BERT script.

The DuoRC dataset has to be converted to SQuAD format before it can be use in any pre-trained model from HuggingFace.

On its own, DuoRC SelfRC has 30% questions without any answers in the context i.e. the answers are expected to be generated. SQuAD v1.1 on the other hand, has all the answers in the given passage, and the answer index and text is provided.

But for this paper, the DuoRC dataset has been converted to SQuAD format (with a start index and answer text). Our conversion isn't exactly same as the authors. The authors rely on Google Research's original script for choosing the examples. We make choices while converting DuoRC to SQuAD format which are mentioned below.

SQuAD

Processing SQuAD was relatively easier compared to DuoRC, as all the pre-trained models by HuggingFace are built on the SQuAD format. We directly use the dataset provided by HuggingFace, and use their tokenizers to return tokenized datasets for training, validation and prediction. We achieved F1 score of 88.51 on SQuAD.

DuoRC - Variant 1

We trained one of our models on SQuAD v1.1 format using this format. This gave a very high score, as many examples from the validation set were also dropped. We discarded this model while performing the next steps.

Example - 'Train S2' means that the dataset is Train and the format chosen for processing is SQuAD v2.0.

Train S1.1 Train S2 Dev S1.1 Dev S2
No Answer Drop Keep Drop Keep
Single Answer Keep Keep Keep Keep
Multiple Answers Keep First Keep First Keep All Keep All
Answer exists but not found in plot Drop Keep Drop Keep

DuoRC Modified - Variant 2

Here, we keep the no answers as empty in all training and validation sets, regardless of SQuAD v1.1 or SQuAD v2 style. We process the examples into SQuAD v1.1 format using the following logic. We have to do so in order to bring the F1 scores of the model closer to those reported in the paper, as well as the authors said that they didn't drop any examples in the validation set while prediction. We achieved F1 score of 50.73 on this version of the dataset.Our analysis/results are based on this form of processing.

Train S1.1 Train S2 Dev S1.1 Dev S2
No Answer Keep Keep Keep Keep
Single Answer Keep Keep Keep Keep
Multiple Answers Keep First Keep First Keep All Keep All
Answer exists but not found in plot Drop Keep Keep Keep

Integrated Gradients

The authors use a custom implementation of Integrated Gradients, with m_steps = 50 and over all the examples in SQuAD and DuoRC. We have implemented Integrated Gradients (IG) using the PyTorch-based library - Captum.

We calculate Integrated Gradients on each layer's input states and use reimann-right numerical approximation. We calculation attributions of the layers on maximum of softmax of start and end logits separately with m_steps = 25. Due to computational restrictions, we had to reduce the number of samples (1000) and steps we could calculate Integrated Gradients on.

Additionally, we consider the best feature for each example predicted by the model only for finding out the importance values.We norm the token attributions generated and normalize it to get a probability distribution. We use this probability distribution to calculate word-wise importances by adding importances for tokens of each word together, and re-normalizing the scores.

Note: In other IG variants, targets can be either:

  • argmax(softmax(logits)) for start and end.

  • best start and end logits based on max(softmax(start_logits)+softmax(end_logits)).

  • ground truth start and end.

Jensen Shannon Divergence

The authors calculate Jensen-Shannon Divergence using the dit library. For 1000 examples, they retain top-2 token importances and zero out the rest and plot heatmap for inter-layer divergence values. They remove top-2 token importances and keep the rest and again plot the heatmap. They observe that the heatmap with top-k retained importances has higher gap in max and min, meaning layers focus on different words while for the top-k removed case, they see almost a uniform distribution.

We repeated this with 1000 features, instead of examples, and observe similar heatmaps.

The essence of this analysis is to look at the gap between max and min values in the heatmap, and which pair of layers have similar top-k importances, and which pair of layers have different top-k importances.

JSD SQuAD Heatmap Retained for K=2

JSD SQuAD Heatmap Removed for K=2

JSD DuoRC Heatmap Retained for K=2

JSD DuoRC Heatmap Removed for K=2

QA Functionality

Based on the word-importances scores, we calculate average percentage of answers, contextual words and query words in the top-5 important words for each layer for 1000 samples.

The results are shown below:

Semantic Statistics For SQuAD

Layer Name Answer Words Contextual Words Q-Words
Embedding 38.10 32.96 22.46
Layer 1 37.58 33.04 22.20
Layer 2 37.10 33.58 24.08
Layer 3 41.00 33.10 19.62
Layer 4 40.42 36.40 16.34
Layer 5 40.82 34.68 18.58
Layer 6 40.74 36.46 15.62
Layer 7 40.06 35.76 14.12
Layer 8 41.90 34.94 11.38
Layer 9 41.18 36.12 11.66
Layer 10 43.36 35.40 9.74
Layer 11 42.52 32.14 10.30
Layer 12 42.94 34.02 10.42

Semantic Statistics For DuoRC

Layer Name Answer Words Contextual Words Q-Words
Embedding 11.78 9.36 24.00
Layer 1 11.70 12.00 19.20
Layer 2 12.60 11.84 17.54
Layer 3 13.36 11.96 16.18
Layer 4 13.16 12.64 20.30
Layer 5 12.68 11.24 22.02
Layer 6 12.96 11.72 15.72
Layer 7 12.68 11.90 12.86
Layer 8 13.36 12.22 8.24
Layer 9 12.66 12.78 5.50
Layer 10 12.90 11.12 6.74
Layer 11 13.06 11.86 7.52
Layer 12 12.94 11.78 8.68

Qualitative Examples

Visualization of top-5 words in SQuAD and DuoRC:

Example 1 for SQuAD : SQuAD Example 1

Example 2 for SQuAD : SQuAD Example 2

Example 1 for DuoRC : DuoRC Example 1

Example 2 for DuoRC : DuoRC Example 2

Quantifier Questions

We calculate the percentage of numerical words in top-5 words out of all numerical words in the passage for Quantifier Questions using NLTK's POS tagger.

SQuAD

Layer % numerical/top-5 % numerical/All Numerical
Embedding 6.06 6.560809
Layer 1 6.24 6.826815
Layer 2 6.50 7.183304
Layer 3 6.80 7.415591
Layer 4 7.02 7.574549
Layer 5 7.50 7.731585
Layer 6 7.54 7.745931
Layer 7 8.02 8.496150
Layer 8 8.64 8.720650
Layer 9 9.12 9.489101
Layer 10 8.18 8.575832
Layer 11 9.44 9.447789
Layer 12 9.48 9.931489

DuoRC

Layer % numerical/top-5 % numerical/All Numerical
Embedding 23.677419 35.009286
Layer 1 24.129032 36.212540
Layer 2 24.709677 35.945527
Layer 3 26.838710 37.727680
Layer 4 26.322581 37.065249
Layer 5 25.290323 36.272995
Layer 6 27.806452 39.428653
Layer 7 29.096774 40.701450
Layer 8 41.354839 53.678327
Layer 9 31.806452 43.113043
Layer 10 35.677419 46.667202
Layer 11 40.000000 51.984534
Layer 12 40.967742 53.044442

t-SNE Results

The t-SNE Results we generated for a SQuAD example are shown below:

tSNE Layer 1 tSNE Layer 5 tSNE Layer 10 tSNE Layer 12

QnA with the authors

  1. BertTokenizer usually breaks a word into multiple tokens due to WordPiece Embeddings. In that case, for some words there will be multiple vectors for each layer. A simple way to combine these would be to average them for a word. How was this handled in the implementation?

    • We keep the embeddings for the different segments of a word separate, and calculate separate integrated gradient scores for them, which we then normalize to get importance scores. Later, we add up the importance scores of all these segments to get the overall importance score of the word. [The segments can be identified by searching for a "##" symbol in the word - this can be checked and confirmed by printing out the passage words].
  2. Bert Base has a max sequence length of 512 tokens. For DuoRC SelfRC the max length of tokens for train is 3467 per passage, with the mean of 660. Similarly, for SQuAD v1.1 the max length is 853, with the mean of 152. For each of these, is the max length set to 512? If that is done, then is only the article/passage/context truncated? If yes, how?

    • We maintain the max length of 384 tokens in both SQuAD and DuoRC in our experiments.
  3. For Duo RC, there are cases where there are multiple answers to a question, and good number of cases where the answer is not present in the passage, what is done regarding these cases during the implementation? Example for Multiple Answers:

    ['to own a hotel', 'to own his own hotel']
    ['Tijuana, Mexico', 'Tiajuana']
    ['Tessa.', 'Tessa', 'Tessa']
    
    • We use the available tensorflow implementation of BERT, which handles multiple answers by itself. Multiple answers are seen in SQuAD as well as DuoRC.
  4. How did you find numerical words/tokens in the passage/question and quantifier questions? I checked a library called word2number but it only works for number spans, and only when it is an exact number word. I couldn't find any popular approaches.

    • We use NLTK POS tagging on the passage words, and the words which have the POS tag of 'CD' (cardinal) are taken to be the quantifier words. On the question side, we take questions which have the words "how many" or "how much" as quantifier questions.
  5. What is the base used for Jensen-Shannon Divergence? The units or log base.

    • We use the implementation of jensen_shannon_divergence from the library dit.divergences . Please check the documentation, I am unable to recollect the exact details now. "from dit.divergences import jensen_shannon_divergence"
  6. How was the contextual passage for t-SNE visualization decided? Was this supposed to be the whole sentence that contains the answer "span"?

    • We chose words within a distance of 5 words on either side of the answer span as contextual words for tables. The whole sentence was chosen for t-SNE visualization.
  7. What were the other training/fine-tuning choices made, with respect to hyperparameters, optimizers, schedulers, etc.?

    • We used the default config given in BERT's official code. However, we changed the batch size to 6 to fit our GPU.
  8. What is EM in 87.35% EM? (mentioned in Section 5.2 in the Quantifier Questions subsection)

    • To measure the performance of the model on answer spans, both SQuAD and DuoRC use the same code - with 2 metrics : F1 score and Exact Match (EM). The 87.35% EM refers to the exact match score.
  9. The paper mentions that all answer spans are in the passage. While that is true for SQuAD, DuoRC has answers not present in the passage. Did you remove such cases?

    • Yes, we remove train cases where the answer is not in the passage (this is done by the BERT code itself). However, we do not remove any data points from the dev set.
  10. I have another doubt regarding t-SNE representations. For multi-token words, do you take the average of all those token representations as the word representation while plotting?

    • tSNE was a qualitative analysis, and for the examples we picked, we didn't observe segmentation of words. If you'reanalyzing examples with segmentation, I guess you could try both merging the embeddings, or keeping the different segments separate.
  11. When calculating Integrated Gradients, for start and end there will be different attribution values for each word representation (because we have two outputs for each input), how was it combined when calculating the IG for each word?

    • We calculate the IG of both the start probabilities and the end probabilities with respect to the word embedding at hand, and then add them up.
  12. I store tokens, token_wise_importances, words, and word_wise_importances (after removing special tokens and combining at ##) The JSD was built on token wise distributions or word wise distributions?

    • JSD was on token wise with length 384, table on word wise importances
  13. Should the IG be calculated on ground targets or predicted outcomes?

    • We calculated the attributions of what the model has predicted, rather than what it should have predicted. We followed another attention analysis based paper for this logic.
  14. What if a particular word is a query word and also in the contextual span (within window size 5 of the answer)?

    • I just consider them twice then.. if a word was both a query word and a contextual word, it probably would have served dual functionality in the training as well, I guess.
    • While finding query words, remove the stopwords from the search.
  15. Should the stopwords be removed for query as well as contextual words? Should the window size be applied after removing stopwords or before? Should the top-5 words contain stopwords?

    • Keep them for contextual. Because they are actually part of the context. But when finding question words in the passage, ignore the stop words in the question because you'll probably find many "is" or "the" or etc in the passage and all they needn't correspond to the query.

    • Should top-5 words include stopwords? - here it's okay

  16. The answer spans/answers in the analysis are actual answers right?

    • again, we chose the answer which the model predicted, not the actual answer span (same logic as used for IG).
  17. Did you take predicted answers for tables and t-SNE as well?

    • I used predicted (and processed) answers for all the analysis after training.

About

Implementation of the paper 'Towards Interpreting BERT for Reading Comprehension QA' for ML Reproducibility Challenge 2020.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published