~nytpu/commons-downloader

Bulk-download files from Wikimedia Commons

NF via public-inbox

7 months ago
11 months ago

#commons-downloader

A script to bulk download files from Wikimedia Commons.

#Features

  • Entirely POSIX shell and very common utilities—runs almost anywhere
  • Download images and videos
  • Download everything in a Category
  • Download everything in a search result (using superior CirrusSearch)
  • Ability to resume downloads without scraping results again nor redownloading existing files
  • Progress output
  • Download into subdirectory

#Acquiring

#Requirements

  • POSIX-compatible sh.
  • cURL, with SSL/TLS support
  • Wget, with SSL/TLS support
  • jq (only for downloading from categories)
  • xmllint (only for downloading from search queries)
  • Utilties as mandated by POSIX:
    • All "Special Built-In Utilities" (should be part of a compliant sh).
    • printf
    • echo
    • true
    • false
    • grep
    • sed
    • tr
    • mkdir
    • sort

Really, if you don't have all the utilities above installed on your machine then you really need to get a better shell or rebuild Busybox or something, they're all pretty basic (other than wget, jq, and xmllint I suppose).

#Get it

Download the script directly: https://git.sr.ht/~nytpu/commons-downloader/blob/master/commons-downloader

Or clone the repo:

git clone https://git.sr.ht/~nytpu/commons-downloader

Then run the script where it is with something like ./commons-downloader or symlink it into your $PATH.

#Usage

Usage: commons-downloader [-chns] [-o outdir] [-q query]... [-r file] <category>

    -c         Download all images in a given category.
    -h         Display this help information.
    -n         No output or progress information.
    -o outdir  Download all images to the given directory (will be created).
    -q query   Additional queries to add when downloading from a search.
    -r file    Resume downloading URIs from a given file.
    -s         Download all images from a search for the given category and queries.
    -u agent   Change the user agent to use for requests.
    category   The formal category name you wish to download from.

The main options are -c, -s, and -r.

-c will download all matches in a category, and -s will download all matches for a search; they can be combined, the downloaded files will be deduplicated so an intersection between them is not an issue.

-r <URL list file> will resume a download given a list of URLs, and is mutually exclusive with -c and -s. The URLs for a given download will be automatically saved in _URLS.txt in the directory holding the downloaded photos.

At least one of -c. -s, and -r is required to be passed.

Multiple -q <add'l query> flags can be added when using -s to add additional queries to a search. It has no effect if -s is not also passed.

For example, commons-downloader -s -q Q173651 -q "African Wild Dog" Lycaon pictus
is equivalent to the search "Lycaon pictus" OR "Q173651" OR "African Wild Dog"

-o <out directory> will download all files to the given directory, creating it if necessary. The current directory is the default if -o is not passed.

The mandatory argument is a category. If only -s is passed it can be an arbitrary search query, but if -c is passed then it must be an official Wikimedia Commons category. A category can be verified by visiting https://commons.wikimedia.org/wiki/Category:<catergory_name>. You can often find a new category by going to the bottom of a Wikipedia page and looking for a box that says:

Wikimedia Commons has media related to: <article name> (category)

You can then click the (category) link to find the Wikimedia Commons category.

#Examples

Download all files in the Panthera uncia category and all results for in the search "Panthera uncia" OR "Q30197" OR "snow leopard" OR "Uncia uncia" to the snep/ subdirectory in the current folder:

commons-downloader -cs -o snep -q Q30197 -q "snow leopard" -q "Uncia uncia" Panthera uncia

If the download in the previous command was interrupted, it could be resumed with:

commons-downloader -o snep -r snep/_URLS.txt

#Contributing

The upstream URL of this project is https://sr.ht/~nytpu/commons-downloader. Send suggestions, bugs, patches, and other contributions to ~nytpu/public-inbox@lists.sr.ht or alex@nytpu.com. For help sending a patch through email, see https://git-send-email.io. You can browse the list archives at https://lists.sr.ht/~nytpu/public-inbox.

Written in 2021–2022 by nytpu <alex [at] nytpu.com>

To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide. This software is distributed without any warranty.

You can view a copy of the CC0 Public Domain Dedication in COPYING or at http://creativecommons.org/publicdomain/zero/1.0/.