RecordIO

RecordIO is a file format created for PaddlePaddle Elastic Deep Learning. It is generally useful for distributed computing.

Motivations

In distributed computing, we often need to dispatch tasks to worker processes. Usually, a task is defined as a partition of the input data, like what MapReduce and distributed machine learning do.

Most distributed filesystems, including HDFS, Google FS, and CephFS, prefer a small number of big files. Therefore, it is impractical to create each task as a small file; instead, we need a format for big files that is

appendable, so that applications can append records to the file without updating the meta-data, thus fault tolerable,
partitionable, so that applications can quickly scan over the file to count the total number of records, and create tasks each corresponds to a sequence of records.

RecordIO is such a file format.

Write

import recordio

# write
with recordio.File('demo.recordio', 'w') as rdio_w: 
    rdio_w.write('abc')
    rdio_w.write('def')

Read

with recordio.File('demo.recordio', 'r') as rdio_r:
    # Random access
    for i in range(rdio_r.count()):
        print(rdio_r.get(i))

    # Range reading
    for record in rdio_r.get_reader(2, 10):
        print(record)

    # Direct iteration
    for record in rdio_r:
        print(record)

Unittest

In this directory:

python -m unittest recordio/recordio/*_test.py

Packaging

The package process largely follows the example of Tensorflow custom op

First build RecordIO devel Docker image:

docker build -t recordio:dev -f Dockerfile .

Start Docker container and map git and bazel .cache directories:

docker run --rm -it \
    -v $HOME/git:/git \
    -v $HOME/.cache:/.cache \
    -w /git/pyrecordio \
    recordio:dev

Inside container, build the pip package:

bazel build build_pip_pkg 
bazel-bin/build_pip_pkg artifacts

After building the package, force install it to replace the existing version.

pip install -I artifacts/recordio-<version>.whl

To test the installed TensorFlow RecordIO Dataset op:

python recordio/tensorflow_op/python/tf_recordio_dataset_test.py

There is also a prepackded version checked into the repo for convenience.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
recordio		recordio
tf		tf
.bazelrc		.bazelrc
.gitignore		.gitignore
BUILD		BUILD
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
WORKSPACE		WORKSPACE
build_pip_pkg.sh		build_pip_pkg.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RecordIO

Motivations

Write

Read

Unittest

Packaging

About

Releases

Packages

Contributors 4

Languages

License

elasticdl/pyrecordio

Folders and files

Latest commit

History

Repository files navigation

RecordIO

Motivations

Write

Read

Unittest

Packaging

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages