Skip to content

jeffrey82221/JSkiner

Repository files navigation

Continuous Integration

JSkiner

The is a python Json Schema Inference Engine with Rust's core. Its inferencing speed is about 10 times of its pure-python counterpart (jsonschema-inference).

Installation

pip install jskiner

Usage

Checking the Json Schema of a Large .jsonl file

jskiner \
    --in <path_to_jsonl> 
    --verbose <false/true> 
    --out <output_file_path>
    --nworkers <number_of_cpu_core>
    --split <number_of_split_batch_size>
    --split-path <path_to_store_the_split_files>

Checking the Json Schema for a folder of json files

jskiner \
    --in <path_to_jsons> 
    --verbose <false/true> 
    --out <output_file_path>
    --nworkers <number_of_cpu_core>
    --batch-size <batch_size_for_inferencing>
    --cuckoo-path <path_to_store_the_cuckoo_filter>
    --cuckoo-size <approximated_size_of_the_cuckoo_filter (Recommend using 10X of current json count)>
    --cuckoo-fpr <false_positive_rate_of_the_cuckoo_filter>

Infering the Schema in Python

from jskiner import InferenceEngine
cpu_cnt = 16
engine = InferenceEngine(cpu_cnt)
json_string_list = ["1", "1.2", "null", "{\"a\": 1}"]
schema = engine.run(json_string_list)
schema

Union({Atomic(Float()), Atomic(Int()), Atomic(Non()), Record({"a": Atomic(Int())})})

Calculate the Union of a List of Schema

from jskiner import InferenceEngine
from jskiner.schema import Atomic, Int, Non
cpu_cnt = 16
engine = InferenceEngine(cpu_cnt)
schema = engine.run([Atomic(Int()), Atomic(Non()])
schema

Optional(Atomic(Int()))

Using | Operation between Two Schema

from jskiner import Atomic, Int, Non
schema = Atomic(Int()) | Atomic(Non())
schema

Optional(Atomic(Int()))

TODO:

  • Enable inference from a folder of json files
  • Enable ignoring of existing json files using cuckoo filter
  • Enable add starting schema file
  • Enable batch-by-batch process on large jsonl file
  • FIX: make sure repr escape special characters.
  • Auto Formatting Using Black
  • Enable sampling of json files
  • Debug: show input that causing panick. (alter panic str / alter reduce.py exception logging)
  • Fix: adding UnionRecord schema object
  • Enable direct inferencing from API online. (able to avoid repeat download of json)
  • Enable Regex to represent patterned FieldSet

About

A python json schema inferencer with Rust's core.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published