Skip to content

Latest commit

 

History

History
77 lines (49 loc) · 2.76 KB

README.md

File metadata and controls

77 lines (49 loc) · 2.76 KB

WEAVIATE SCHEMA

"A schema is used to define the concepts of the data you will be adding to Weaviate."

Basically in schema we can specify what classes are used, what properties each class has and optionally we can specify relations between classes.

Examples of how to add schema:


CLASSES

In this project CNBC dataset is used and it has two classes:

  • Article
  • Author

PROPERTIES

Each class has it's own set of properties:

  • Article has:
    • title
    • url
    • publish date
    • who is the author
    • 2 variants of descriptions
    • set of keywords
  • Author has:
    • name

REFERENCES

These two classes are related to each other. Articles has property "hasAuthors" which stores link to it's author, and Authors has property "hasArticles" - link to written articles.


DATA TYPES

Each property has it's own data type. List of types is described here. Properties that contains reference to other classes has data type of referenced class:

  • hasAuthors data type is Author
  • hasArticles data type is Article

VECTORIZATION

With text2vec-transformers module Weaviate for each class concatenates all text properties, sends it to vectorization module and uses this vector during vector search.

That's why if you want to exclude any properties from vectorization process you can provide skip parameter:

"moduleConfig": {
    "text2vec-transformers": {
        "skip": true
    }
}

In "schema.json" file you can notices that vectorization is disabled for:

  • url: vector of url will only add noise
  • short_description: it's basically shorten version of description, only vector for description is used
  • keywords: in my opinion it's not helpful but I might be wrong
  • name of the author: that's definitely will not help

As a reminder: from what I noticed during debugging custom text2vec-transformers module Weaviate doesn't vectorize each text property separately and then takes mean value of them, but rather concatenates all text properties, vectorizes this string and attaches this vector to the object of class (in our case each article). Then this vector is used during vector search. So be careful what properties to include into vector representation of document.

Other properties are not string or text data types so they are not used for vectorization.