The is a python Json Schema Inference Engine with Rust's core. Its inferencing speed is about 10 times of its pure-python counterpart (jsonschema-inference).
pip install jskiner
jskiner \
--in <path_to_jsonl>
--verbose <false/true>
--out <output_file_path>
--nworkers <number_of_cpu_core>
--split <number_of_split_batch_size>
--split-path <path_to_store_the_split_files>
jskiner \
--in <path_to_jsons>
--verbose <false/true>
--out <output_file_path>
--nworkers <number_of_cpu_core>
--batch-size <batch_size_for_inferencing>
--cuckoo-path <path_to_store_the_cuckoo_filter>
--cuckoo-size <approximated_size_of_the_cuckoo_filter (Recommend using 10X of current json count)>
--cuckoo-fpr <false_positive_rate_of_the_cuckoo_filter>
from jskiner import InferenceEngine
cpu_cnt = 16
engine = InferenceEngine(cpu_cnt)
json_string_list = ["1", "1.2", "null", "{\"a\": 1}"]
schema = engine.run(json_string_list)
schema
Union({Atomic(Float()), Atomic(Int()), Atomic(Non()), Record({"a": Atomic(Int())})})
from jskiner import InferenceEngine
from jskiner.schema import Atomic, Int, Non
cpu_cnt = 16
engine = InferenceEngine(cpu_cnt)
schema = engine.run([Atomic(Int()), Atomic(Non()])
schema
Optional(Atomic(Int()))
from jskiner import Atomic, Int, Non
schema = Atomic(Int()) | Atomic(Non())
schema
Optional(Atomic(Int()))
- Enable inference from a folder of json files
- Enable ignoring of existing json files using cuckoo filter
- Enable add starting schema file
- Enable batch-by-batch process on large jsonl file
- FIX: make sure repr escape special characters.
- Auto Formatting Using Black
- Enable sampling of json files
- Debug: show input that causing panick. (alter panic str / alter reduce.py exception logging)
- Fix: adding UnionRecord schema object
- Enable direct inferencing from API online. (able to avoid repeat download of json)
- Enable Regex to represent patterned FieldSet