Skip to main content Link Search Menu Expand Document (external link)

Data Compression

The GT3X RAW files are large in size, and transmitting them through disk or network could take a lot of time and substantial storage space. To mitigate these issues, we recommend compressing these files using gzip as the files are highly compressible (we reduced 800GB of dataset to 40GB). Then use the --gzipped toggle of pre_process_data.py to consume the gzipped documents directly. Note: only the RAW files can be passed as gzipped, each RAW file must be compressed individually, and the extension must be .csv.gz.