Download/Upload Datasets with Kaggle API

Background

Kaggle is well known for hosting a large number of datasets and competitions. For machine learning and deep learning, Kaggle is one of the most effective channels to search for datasets resources. Sometimes when you get trapped in obtaining specific datasets, you may find such datasets exist on Kaggle already. However, downloading a dataset from the Kaggle site can be vulnerable, depending on your network environment; moreover, datasets can only be downloaded to local from the website. If you are running data on your server, you need an extra step for dataset transmission. As far as I can tell, uploading datasets to Kaggle can be a pain in the ass, especially when the size gets large. While it can hugely depend on your network conditions, it is very vulnerable.

Kaggle CLI provides a shortcut for accessing datasets online. You can directly download the dataset you want to your server with a stable transmission speed. Local datasets can also be uploaded easily. This blog will cover setting up the Kaggle API on your local device and downloading/uploading the datasets from the Command Line Interface.

Installation

Easy install with pip.

1
pip install kaggle

Refer to issues regarding install here, had you encountered any errors.

Setup

Before you play with Kaggle in the command line, you need to convince kaggle that your PC is authorized to access their database. So first thing first, you need to obtain an API token from Kaggle. Go to your user profile’s Account tab and select Create API Token. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json for mac and Linux; place this file in C:\Users\<Windows-username>\.kaggle\kaggle.json if you are a windows user.

Download Datasets

Once you have configured the Kaggle CLI, you are ready to download any datasets at one line of command.
Say you want to download an X-ray Image dataset about Pneumonia. Click the Copy API Command, and you will get the essential information to download the dataset.

1
kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

X-ray Image Dataset (Pneumonia)

If you enter the command you just copied, you will find the dataset ends up in the .kaggle directory, which in most cases, is not the intended destination. Take a look at the API help information.

1
kaggle datasets download -h

The help information looks like

1
2
3
4
5
6
7
8
9
10
11
12
13
usage: kaggle datasets download [-h] [-f FILE_NAME] [-p PATH] [-w] [--unzip] [-o] [-q] [dataset]

optional arguments:
-h, --help show this help message and exit
dataset Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
-f FILE_NAME, --file FILE_NAME
File name, all files downloaded if not provided
(use "kaggle datasets files -d <dataset>" to show options)
-p PATH, --path PATH Folder where file(s) will be downloaded, defaults to current working directory
-w, --wp Download files to current working path
--unzip Unzip the downloaded file. Will delete the zip file when completed.
-o, --force Skip check whether local version of file is up to date, force file download
-q, --quiet Suppress printing information about the upload/download progress

You can assign the download destination to any directory by the -p flag or the current working directory by the -w flag. The downloaded object will be a compressed file; therefore, you can add the --unzip flag to save some time.

An sample command looks like

1
kaggle datasets download -d paultimothymooney/chest-xray-pneumonia -w --unzip

The the dataset will be downloaded like this,

Dataset downloading

After the download completes, you will find the dataset directory chest_xray already unzipped.

Upload Datasets

Sometimes we also want to upload datasets for future usage. Here’s what we can do. We can create a new dataset, say random_dataset, containing a random file random.jpg.

First, we need to initiate a metadata file.

1
2
cd random_dataset
kaggle datasets init

Revise the dataset title and dataset id in the generated dataset-metadata.json. It is suggested that the title and the id do not contain underscores.

Then, we can upload the dataset,

1
kaggle datasets create -p random_dataset 

You can also add a -u flag to make it public via dataset creation. The upload will then be triggered.

1
2
3
4
(base) ➜  kaggle_cli kaggle datasets create -p randomdataset
Starting upload for file random.jpg
100%|████████████████████████████████████████████████████████████████████████████████| 65.4k/65.4k [00:02<00:00, 30.7kB/s]
Upload successful: random.jpg (65KB)

The network IP address in China works fine when downloading datasets. However, the upload process will require an IP address outside China.

References

Author

Ziang Zhou

Posted on

2022-04-29

Updated on

2022-08-30

Licensed under

Comments