Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pre-tokenized queries / documents does not work at the moment #50

Open
mam10eks opened this issue Jul 30, 2024 · 5 comments
Open
Assignees

Comments

@mam10eks
Copy link
Member

This commit adds some failing unit tests: 4a747d4

Should be simple to resolve this. We load the term-pipeline from the terrier index which we implemented at a time when the pre-tokenized feature was not yet available in PyTerrier, so we likely have a wrong pipeline in case pre-tokenized is specified.

@mam10eks mam10eks self-assigned this Jul 30, 2024
@mam10eks
Copy link
Member Author

cc @Parry-Parry, @heinrichreimer.

@mam10eks
Copy link
Member Author

Alright, for pretokenized indexes, termpipelines= is in the index/data.properties file, and in this case ir_axioms uses a default term-pipeline that applies some normalization.

@heinrichreimer Do you have any preferences how we could solve this? E.g., so that it is usable but maybe still compatible with previous behaviour?

@Parry-Parry
Copy link

@heinrichreimer @mam10eks So I assume the default pipe is stopwords, porter stemmer, this is always included in data.properties should shouldn't be an issue in the default case

@mam10eks
Copy link
Member Author

one possible suggestion could also be that we introduce a new PreTokenizedTerrierIndexContext that is a TerrierIndexContext and jst overrides the termpipeline property?

@janheinrichmerker
Copy link
Collaborator

I'd say it would be best to fix this in the PyTerrier backend here:

def terms(
self,
query_or_document: Union[Query, Document]
) -> Sequence[str]:
text = self.contents(query_or_document)
return self._terms(text)

Is there a PyTerrier API to access the pre-tokenized terms given the document ID?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants