Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation bug in tokenized input example #31

Open
MarcelloPerathoner opened this issue Feb 2, 2016 · 0 comments
Open

Documentation bug in tokenized input example #31

MarcelloPerathoner opened this issue Feb 2, 2016 · 0 comments

Comments

@MarcelloPerathoner
Copy link
Contributor

This example from the documentation:

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "white" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}

is misleading because the tokens "t" should include trailing whitespace when appropriate. If you use the built-in tokenizer instead, the tokens include whitespace by default. Also, the normalized "n" should be shown to exclude whitespace so as not to fool the token comparators.

Why is this important? Because if you omit whitespace the segment joining phase will run tokens together like this:

digraph G {
  v0 [label = ""];
  v1 [label = "Ablackkitten."];
  v2 [label = ""];
  v0 -> v1 [label = "A, B"];
  v1 -> v2 [label = "A, B"];
  v0 -> v2 [color =  "white"];
}

N.B. This output was generated from the slighly modified (to exercise the segment joiner) input:

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "black" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants