~nytpu/commons-downloader

Bulk-download files from Wikimedia Commons
2 days ago

30f0ca2 rearrange README to make requirements clearer

2 days ago

#commons-downloader

A script to bulk download files from Wikimedia Commons.

#Features

  • Entirely POSIX shell and very common utilities—runs almost anywhere
  • Download images and videos
  • Download everything in a Category
  • Download everything in a search result (using superior CirrusSearch)
  • Ability to resume downloads without scraping results again nor redownloading existing files
  • Progress output
  • Download into subdirectory

#Acquiring

#Requirements

  • POSIX-compatible sh.
  • cURL, with SSL/TLS support
  • Wget, with SSL/TLS support
  • jq (only for downloading from categories)
  • xmllint (only for downloading from search queries)
  • Utilties as mandated by POSIX:
    • All "Special Built-In Utilities" (should be part of a compliant sh).
    • printf
    • echo
    • true
    • false
    • grep
    • sed
    • tr
    • mkdir
    • sort

#Get it

Download the script directly: https://git.sr.ht/~nytpu/commons-downloader/blob/master/commons-downloader

Or clone the repo:

git clone https://git.sr.ht/~nytpu/commons-downloader

Then run the script where it is with something like ./commons-downloader or symlink it into your $PATH.

#Usage

See commons-downloader -h at any time for an overview.

The main options are -c, -s, and -r.

-c will download all matches in a category, and -s will download all matches for a search; they can be combined, the downloaded files will be deduplicated so an intersection between them is not an issue.

-r <URL list file> will resume a download given a list of URLs, and is mutually exclusive with -c and -s.

At least one of -c. -s, and -r is required to be passed.

Multiple -q <add'l query> flags can be added when using -s to add additional queries to a search. It has no effect if -s is not also passed.

For example, commons-downloader -s -q Q173651 -q "African Wild Dog" Lycaon pictus is equivalent to the search "Lycaon pictus" OR "Q173651" OR "African Wild Dog"

-o <out directory> will download all files to the given directory, creating it if necessary. The current directory is the default if -o is not passed.

The mandatory argument is a category. If only -s is passed it can be an arbitrary search query, but if -c is passed then it must be an official Wikimedia Commons category. A category can be verified by visiting https://commons.wikimedia.org/wiki/Category:<catergory_name>. You can often find a new category by going to the bottom of a Wikipedia page and looking for a box that says:

Wikimedia Commons has media related to: <article name> (category)

You can then click the (category) link to find the Wikimedia Commons category.

#Examples

Download all files in the Panthera uncia category and all results for in the search "Panthera uncia" OR "Q30197" OR "snow leopard" OR "Uncia uncia" to the snep/ subdirectory in the current folder:

commons-downloader -cs -o snep -q Q30197 -q "snow leopard" -q "Uncia uncia" Panthera uncia

If the download in the previous command was interrupted, it could be resumed with:

commons-downloader -o snep -r snep/_URLS.txt

#Contributing

The upstream URL of this project is https://git.sr.ht/~nytpu/commons-downloader. Send suggestions, bugs, patches, and other contributions to ~nytpu/public-inbox@lists.sr.ht. For help sending a patch through email, see https://git-send-email.io. You can browse the list archives at https://lists.sr.ht/~nytpu/public-inbox.

Written in 2021 by nytpu <alex [at] nytpu.com>

To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide. This software is distributed without any warranty.

You can view a copy of the CC0 Public Domain Dedication in COPYING or at http://creativecommons.org/publicdomain/zero/1.0/.