You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current IMDB dataset suffers from very few examples of the actual vader lexicon. As such, let's create two new datasets that have high overlap with the vader lexicon.
A simple version, that picks from openwebtext, and uses sentences that have high overlap with vader.
A "poisoned" version that flips the reward of 30 of the vader tokens. This will give us a base line to see if our IRM's can recover these tokens.
The columns of the dataset will be text, lexicon_tokens, token_rewards_dict and poisoned which is a (usually empty) list of tokens. There were will be 30 of these.
The vader lexicon tokens will be ordered by their frequency in english, and the top 4000 will be picked, with 5 occurrences each.
The text was updated successfully, but these errors were encountered:
The current IMDB dataset suffers from very few examples of the actual vader lexicon. As such, let's create two new datasets that have high overlap with the vader lexicon.
The columns of the dataset will be
text
,lexicon_tokens
,token_rewards_dict
andpoisoned
which is a (usually empty) list of tokens. There were will be 30 of these.The vader lexicon tokens will be ordered by their frequency in english, and the top 4000 will be picked, with 5 occurrences each.
The text was updated successfully, but these errors were encountered: