Using wget command in Google Colab to retrieve datasets from source
If like me you’ve started your journey in learning AI and ML using Google Colab, at some point you will need to upload datasets to your Google Drive to be accessible in Google Colab. There are several ways to accomplish this task. Let’s examine them…
Taking the Wisconsin Breast Cancer dataset as an example: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
Manually downloading the dataset to your local computer then uploading it to your Google Drive — The long-winded way (not recommended)
Accessing the breast-cancer-wisconsin folder shows the two datafiles amongst others and clicking on them will result in downloading the .data and .names files to your local computer:
Accessing your Downloads folder will show the two files downloaded:
Access your Google Drive and double click on a folder you wish to place the files e.g. My_Python_Programs. From there right mouse click and from the menu select ‘New Folder’:
Give the folder a descriptive name and click on ‘Create’
This creates an empty dataset folder in Google Colab:
Now that the dataset folder is created we can manually place the downloaded files into this folder. Click on the ‘New’ icon in the top left of the Google Drive screen:
Now click on ‘File upload’ from the menu:
Select the two files that were downloaded that are present in the Downloads folder (on Mac use the cmd key and left click the unhighlighted file to select multiple files). With both files highlighted click the ‘Open’ button:
The files are uploaded to your Google Drive:
Job Done! Albeit, we downloaded the dataset to our local computer only to upload it to Google Drive and it took several steps. Although this method works, it’s not feasible when dealing with large datasets. If only there was a better way. Cue wget…
Since Google Colab is built on Linux we can execute Linux commands in Colab and one of the commands to retrieve datasets is wget. wget stands for ‘web get’ and using this command will retrieve the dataset directly from the source straight to the Google Drive without being downloaded to your computer. Far more efficient and faster.
Let’s rewind the steps above to the point that you’ve created the dataset folder:
From here, right mouse click and hover over More then click on ‘Google Colaboratory’
Mount your Google Drive by clicking on the bottom left folder icon followed by the right-most icon that shows ‘Mount Drive’:
Click ‘CONNECT TO GOOGLE DRIVE’:
Once the Google Drive is mounted simply drill down the directories and find the dataset folder we created earlier. Right click the folder and click on ‘Copy path’:
Now jump back to Colab and in the code section enter the following:
!wget -P
Press the spacebar and use Ctrl + V (on Mac cmd + V) to paste the contents of the clipboard at the end of this command:
Now press the spacebar to create a whitespace
Head over to your breast-cancer-wisconsin dataset directory, locate the .data file, right click and click ‘Copy Link Address’
Head back to Colab and click Ctrl + V (cmd + V on Mac) to paste the contents of the clipboard into the code cell:
Click the ‘Run Cell’ button, the output displays the status of the download and the location of the file downloaded in you Google Drive:
Accessing the Google Drive dataset folder should show the .data file present:
In Colab repeat the above with the .names file by hovering below the code area which shows the ‘Code’ and ‘Text’ buttons and click on the ‘Code’ button to create another code section:
Copy and Paste the !wget -P command and the location of the dataset folder:
Head over to your breast-cancer-wisconsin dataset directory, locate the .names file, right click and click ‘Copy Link Address’
In Colab and the end of the code pasted press Ctrl + V (cmd + V on Mac) to paste the contents of the clipboard into the code cell (ensure there is a whitespace separating the dataset location and the URL pasted). Click the ‘Run Cell’ button…
Output displays the status of the download and the location of the file downloaded in you Google Drive:
Accessing the Google Drive dataset folder should now show both the .data file and .names file present:
In the above example we ran two commands, one for the .data file and another for the .names file. We can use the wget command with both these URL locations to import both the .names and .data datafiles concurrently. We need to run the following command:
!wget -P {location of where you’d like the files to go} {first file to retrieve} {second file to retrieve} {nth file to retrieve}
Which translates in our example to:
!wget -P /content/drive/MyDrive/My_Python_Programs/breast_cancer_dataset https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
Click the ‘Run cell’ button and both the .data and the .names files are downloaded in one fell swoop into the dataset directory:
This will update the directory structure of the mounted Google Drive to show the datasets present:
P.S. If you want a good machine learning tutorial to follow along and make use of the Wisconsin breast cancer dataset by implementing a K-Nearest-Neighbour algorithm check out https://pythonprogramming.net/k-nearest-neighbors-application-machine-learning-tutorial/?completed=/k-nearest-neighbors-intro-machine-learning-tutorial/
I hope this helps. Any questions or comments just leave in the response section. Happy Python coding in Colab…