Documentation | Repository | PyPI |
This is a configuration based file caching downloader with initial support for http requests & queries against elasticsearch.
KGHub Downloader is available to install via pip:
pip install kghub-downloader
The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see example/download.yaml
Available options are:
http(s)
ftp
glob:
option to download files with specific extensions (only with ftp as of now and looks recursively).gs://
)gdrive://
or https://drive.google.com/…). The file must be publicly accessible.s3://
)git://RepositoryOwner/RepositoryName
)If the URL includes a name in {CURLY_BRACES}
, it will be expanded from environment variables.
elasticsearch
* Note:
Google Cloud Storage URLs require that you have set up your credentials as described here. You must:
- create a service account
- add the service account to the relevant bucket and
- download a JSON key for that service account.
Then, set theGOOGLE_APPLICATION_CREDENTIALS
environment variable to point to that file.Mirorring local files to Amazon AWS S3 bucket requires the following:
- Create an AWS account
- Create an IAM user in AWS: This enables getting the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
needed for authentication. These two should be stored as environment variables in the user’s system.- Create an S3 bucket: This will be the destination for pushing local files.
You can also include any secrets like API keys you have set as environment variables using {VARIABLE_NAME}
, for example:
---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
localname: myfancyfile.json
Note: YOUR_SECRET
MUST as an environment variable, and be sure to include the {curly braces} in the url string.
Downloader can be used directly in Python or via command line
from kghub_downloader.download_utils import download_from_yaml
download_from_yaml(yaml_file="download.yaml", output_dir="data")
To download files listed in a download.yaml file:
$ downloader [OPTIONS] [YAML_FILE]
Arguments:
[YAML_FILE]
: List of files to download in YAML format [default: download.yaml]Options:
--output-dir TEXT
: Path to output directory [default: .]--ignore-cache / --no-ignore-cache
: Ignoring already downloaded files and download again [default: no-ignore-cache]--progress / --no-progress
: Show progress for individual downloads [default: progress]--fail-on-error / --no-fail-on-error
: Do not attempt to download more files if one raises an error [default: no-fail-on-error]--snippet-only / --no-snippet-only
: Only download a snippet of the file. [HTTP(S) resources only. [default: no-snippet-only]--verbose / --no-verbose
: Show verbose output [default: no-verbose]--tags TEXT
: Optional list of tags to limit downloading to--mirror TEXT
: Optional remote storage URL to mirror download to. Supported buckets: Google Cloud StorageExamples:
$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory
# Note that if your YAML file is named `download.yaml`,
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output
git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install
poetry run pytest
NOTE: The tests require gcloud credentials to be set up as described above, using the Monarch github actions service account.