Download/Upload Datasets with Kaggle API
Background
Kaggle is well known for hosting a large number of datasets and competitions. For machine learning and deep learning, Kaggle is one of the most effective channels to search for datasets resources. Sometimes when you get trapped in obtaining specific datasets, you may find such datasets exist on Kaggle already. However, downloading a dataset from the Kaggle site can be vulnerable, depending on your network environment; moreover, datasets can only be downloaded to local from the website. If you are running data on your server, you need an extra step for dataset transmission. As far as I can tell, uploading datasets to Kaggle can be a pain in the ass, especially when the size gets large. While it can hugely depend on your network conditions, it is very vulnerable.
Kaggle CLI provides a shortcut for accessing datasets online. You can directly download the dataset you want to your server with a stable transmission speed. Local datasets can also be uploaded easily. This blog will cover setting up the Kaggle API on your local device and downloading/uploading the datasets from the Command Line Interface.
Installation
Easy install with pip.
1 | pip install kaggle |
Refer to issues regarding install here, had you encountered any errors.
Setup
Before you play with Kaggle in the command line, you need to convince kaggle that your PC is authorized to access their database. So first thing first, you need to obtain an API token from Kaggle. Go to your user profile’s Account
tab and select Create API Token
. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json
for mac and Linux; place this file in C:\Users\<Windows-username>\.kaggle\kaggle.json
if you are a windows user.
Download Datasets
Once you have configured the Kaggle CLI, you are ready to download any datasets at one line of command.
Say you want to download an X-ray Image dataset about Pneumonia. Click the Copy API Command
, and you will get the essential information to download the dataset.
1 | kaggle datasets download -d paultimothymooney/chest-xray-pneumonia |
If you enter the command you just copied, you will find the dataset ends up in the .kaggle
directory, which in most cases, is not the intended destination. Take a look at the API help information.
1 | kaggle datasets download -h |
The help information looks like
1 | usage: kaggle datasets download [-h] [-f FILE_NAME] [-p PATH] [-w] [--unzip] [-o] [-q] [dataset] |
You can assign the download destination to any directory by the -p
flag or the current working directory by the -w
flag. The downloaded object will be a compressed file; therefore, you can add the --unzip
flag to save some time.
An sample command looks like
1 | kaggle datasets download -d paultimothymooney/chest-xray-pneumonia -w --unzip |
The the dataset will be downloaded like this,
After the download completes, you will find the dataset directory chest_xray
already unzipped.
Upload Datasets
Sometimes we also want to upload datasets for future usage. Here’s what we can do. We can create a new dataset, say random_dataset
, containing a random file random.jpg
.
First, we need to initiate a metadata file.
1 | cd random_dataset |
Revise the dataset title
and dataset id
in the generated dataset-metadata.json
. It is suggested that the title
and the id
do not contain underscores.
Then, we can upload the dataset,
1 | kaggle datasets create -p random_dataset |
You can also add a -u
flag to make it public via dataset creation. The upload will then be triggered.
1 | (base) ➜ kaggle_cli kaggle datasets create -p randomdataset |
The network IP address in China works fine when downloading datasets. However, the upload process will require an IP address outside China.
References
Download/Upload Datasets with Kaggle API
https://realzza.github.io/blog/Download-Upload-Datasets-with-Kaggle-API/