Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Accept compressed files as input to predict when using a Predictor #5237

@danieldeutsch

Description

@danieldeutsch

Is your feature request related to a problem? Please describe.
I typically used compressed datasets (e.g. gzipped) to save disk space. This works fine with AllenNLP during training because I can write my dataset reader to load the compressed data. However, the predict command opens the file and reads lines for the Predictor. This fails when it tries to load data from my compressed files.

def _get_json_data(self) -> Iterator[JsonDict]:
if self._input_file == "-":
for line in sys.stdin:
if not line.isspace():
yield self._predictor.load_line(line)
else:
input_file = cached_path(self._input_file)
with open(input_file, "r") as file_input:
for line in file_input:
if not line.isspace():
yield self._predictor.load_line(line)

Describe the solution you'd like
Either automatically detect the file is compressed or add a flag to predict that indicates that the file is compressed. One method that I have used to detect if a file is gzipped is here, although it isn't 100% accurate. I have an implementation here. Otherwise a flag like --compression-type to mark how the file is compressed should be sufficient. Passing the type of compression would allow support for gzip, bz2, or any other method.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions