Skip to content

Similar subtree matching notes

Gunther Cox edited this page Nov 21, 2016 · 4 revisions

Goal: Given a list of strings, find the most similar matching sequence in a graph data structure.

  • The found sequence may have additional strings or strings that are not in the one being searched for.
  • An optimal solution will have the greatest number of strings that have a close match in the original sequence, and the fewest number of extraneous strings added.
  • The last statement in the best matching sequence that gets selected should have possible responses to choose from.

Ideas

  1. Convert text to stemmed text
  2. Compare stemmed strings

A "smarter" comparison algorithm could equate similar sentences by their actual meaning. For example, sentences such as:

Have a free sample of this frozen dairy beverage. Try a sample of ice cream at no charge.

These two statements have very different text but the same meaning.

  1. Get every possible ordered combination of the sequence. This handles the case that an item (or several items) exist in the sequence but do not exist in the graph. We will want to find a match that has the greatest number of matching sequences.
  2. Sort the list of possible ordered combinations by length (longest first). Later, we will be able to stop checking for a match if we find a match that has fewer items in the sequence and a lower comparison score than the current one.

Object model ideas

  • Communication: A connection that represents the response relationship between two statemets and holds other meta data.

    • speaker: F-key Persona
    • date: datetime
    • statement: F-key Statement
    • in_response_to: F-key Statement - The parent node
  • Conversation

    • statements: A ordered list of Statement objects representing a conversation
  • Interaction

    • input
      • meta data: {persona, statements, datetime, ...}
    • output
      • meta data: {...}

References

  • See Algorithms book, page 380
    • Wildcard between each statement?
  • Think of a sequence of strings in the graph as one long string?
  • Tries data structure