Reading and writing gzip files, per the RFC.
- Simplicity -- we've often found the batteries-included python gzip implementation to be more trouble than it's worth
- Completeness -- most of the header information in the
gzip
container don't seem to be used very widely. All the same, this aims to make those fields accessible - Performance (eventually, if necessary) -- if at some point we feel it necessary, we may promote this to a C++ extension
- Multiple members -- while the
gzip
container format describes how to add multiple members to an archive, this functionality is not included - Incremental reads -- currently three modes are supported: 1) incrementally reading
chunks of data limited to a pre-determined size, 2) reading the lines of a file
incrementally, or 3) reading the whole file in at once. The
read(<size>)
call, specifically is not yet supported. - Extra flags -- the
FEXTRA
field describes a list of pairs of(subfield ID, field)
, and while it's made accessible (for both reading and writing), this does not include any utilities to easily format it.
gzippy
is available on pypi
:
pip install gzippy
At the top-level, gzippy
provides an open
function much like gzip
's:
import gzippy
with gzippy.open('example.gz', 'wb') as fout:
fout.write('Some content.')
with gzippy.open('example.gz', 'rb') as fin:
print(fin.read())
Data can be read all at once with a call to read()
, or incrementally with a call to
the chunks
generator (which reads chunks of a predetermined size from the compressed
file and yields decompressed blocks), or the lines
generator (which yields lines):
# Read it in all at once
with gzippy.open('smallish-archive.gz') as fin:
data = fin.read()
# Read it in manageable chunks (on the order of 4KB)
with gzippy.open('really-big-archive.gz') as fin:
for chunk in fin.chunks():
...
# Read it in line-by-line
with gzippy.open('really-big-archive.gz') as fin:
# Alternatively, fin.lines()
for line in fin:
...
When writing a file, additional headers may be provided using the GzipWriter
class
directly:
with gzippy.GzipWriter.open('example.gz', name='example', comment='comment') as fout:
...
Similarly, these headers are available upon reading a file:
with gzippy.open('example.gz') as fin:
print(fin.headers)
It's recommended that you use a virtualenv
to develop gzippy
:
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
Tests are run with:
make test
These are not all hard-and-fast rules, but in general PRs have the following expectations:
- pass Travis -- or more generally, whatever CI is used for the particular project
- be a complete unit -- whether a bug fix or feature, it should appear as a complete unit before consideration.
- maintain code coverage -- some projects may include code coverage requirements as part of the build as well
- maintain the established style -- this means the existing style of established projects, the established conventions of the team for a given language on new projects, and the guidelines of the community of the relevant languages and frameworks.
- include failing tests -- in the case of bugs, failing tests demonstrating the bug should be included as one commit, followed by a commit making the test succeed. This allows us to jump to a world with a bug included, and prove that our test in fact exercises the bug.
- be reviewed by one or more developers -- not all feedback has to be accepted, but it should all be considered.
- avoid 'addressed PR feedback' commits -- in general, PR feedback should be rebased back into the appropriate commits that introduced the change. In cases, where this is burdensome, PR feedback commits may be used but should still describe the changed contained therein.
PR reviews consider the design, organization, and functionality of the submitted code.
Certain types of changes should be made in their own commits to improve readability. When too many different types of changes happen simultaneous to a single commit, the purpose of each change is muddled. By giving each commit a single logical purpose, it is implicitly clear why changes in that commit took place.
- updating / upgrading dependencies -- this is especially true for invocations like
bundle update
orberks update
. - introducing a new dependency -- often preceeded by a commit updating existing dependencies, this should only include the changes for the new dependency.
- refactoring -- these commits should preserve all the existing functionality and merely update how it's done.
- utility components to be used by a new feature -- if introducing an auxiliary class in support of a subsequent commit, add this new class (and its tests) in its own commit.
- config changes -- when adjusting configuration in isolation
- formatting / whitespace commits -- when adjusting code only for stylistic purposes.
Small new features (where small refers to the size and complexity of the change, not the impact) are often introduced in a single commit. Larger features or components might be built up piecewise, with each commit containing a single part of it (and its corresponding tests).
In general, bug fixes should come in two-commit pairs: a commit adding a failing test demonstrating the bug, and a commit making that failing test pass.
Whenever the version included in setup.py
is changed (and it should be changed when
appropriate using http://semver.org/), a corresponding tag should
be created with the same version number (formatted v<version>
).
git tag -a v0.1.0 -m 'Version 0.1.0
This release contains an initial working version of the `crawl` and `parse`
utilities.'
git push origin