Is your feature request related to a problem? Please describe.
I typically used compressed datasets (e.g. gzipped) to save disk space. This works fine with AllenNLP during training because I can write my dataset reader to load the compressed data. However, the predict command opens the file and reads lines for the Predictor. This fails when it tries to load data from my compressed files.
|
def _get_json_data(self) -> Iterator[JsonDict]: |
|
if self._input_file == "-": |
|
for line in sys.stdin: |
|
if not line.isspace(): |
|
yield self._predictor.load_line(line) |
|
else: |
|
input_file = cached_path(self._input_file) |
|
with open(input_file, "r") as file_input: |
|
for line in file_input: |
|
if not line.isspace(): |
|
yield self._predictor.load_line(line) |
Describe the solution you'd like
Either automatically detect the file is compressed or add a flag to predict that indicates that the file is compressed. One method that I have used to detect if a file is gzipped is here, although it isn't 100% accurate. I have an implementation here. Otherwise a flag like --compression-type to mark how the file is compressed should be sufficient. Passing the type of compression would allow support for gzip, bz2, or any other method.
Is your feature request related to a problem? Please describe.
I typically used compressed datasets (e.g. gzipped) to save disk space. This works fine with AllenNLP during training because I can write my dataset reader to load the compressed data. However, the
predictcommand opens the file and reads lines for thePredictor. This fails when it tries to load data from my compressed files.allennlp/allennlp/commands/predict.py
Lines 208 to 218 in 39d7e5a
Describe the solution you'd like
Either automatically detect the file is compressed or add a flag to
predictthat indicates that the file is compressed. One method that I have used to detect if a file is gzipped is here, although it isn't 100% accurate. I have an implementation here. Otherwise a flag like--compression-typeto mark how the file is compressed should be sufficient. Passing the type of compression would allow support for gzip, bz2, or any other method.