Downloading OME Example Images

by Blair Rossetti
Apr 23, 2020

scripts, python

Faster in the Long Run… #

For a recent blog post, I needed to get my hands on some Imaris IMS files. Fortunately for me, the good folks over at OME have a nice repository of example images in a variety of formats. The OME Team use this data for testing the read/write functions in their Bio-Formats library. I was particularly interested in downloading all of their Imaris images, but it appeared the only way to get the entire dataset was to manually download each file. I can’t waste that much time downloading files! Instead, I decided to waste more time writing a script to download the files for me. Programmer logic for the win.

Download All the Things #

I wrote the script in python because there are a plethora of web scraping libraries already available. I am using requests to download the content of each webpage and BeautifulSoup to parse the html. Since the download site is configured as a directory hierarchy, a recursive solution seemed to be the most elegant way forward. In essence, the script retrieves all of the links from a given webpage. Any URLs that do not contain html are assumed to be files (e.g., text or image files), and they are downloaded. URLs that contain html are assumed to be directories, and they are recursively searched for more links. If you find yourself in need of OME’s data sets, I hope this script proves to be useful. Enjoy!

import os
import requests
from bs4 import BeautifulSoup, SoupStrainer

def get_links(html):
    """Returns a list of links parsed from an OME downloads webpage"""
    
    soup = BeautifulSoup(html, "html.parser", parse_only=SoupStrainer("td"))
    links = [tag.string for tag in soup.find_all('a')]

    # return all but the first link (which is the parent directory)
    return links[1:]

def get_files(url, output_path):
    """Recursively downloads files from a OME downloads webpage to the output path"""

    print(f"{url}... ", end='')
    r = requests.get(url, stream=True)

    if r.status_code != requests.codes.ok:
        print(f"failed with status code {r.status_code}")
        return
    
    if "html" in r.headers['Content-Type']:
        # if the content is html, then it must be a directory
        if not os.path.exists(output_path):
            os.makedirs(output_path)
            print("made directory")
        else:
            print("directory exists")

        # parse html for links
        links = get_links(r.content)

        # recursively call function to download all links
        for link in links:
            get_files(r.url + link, os.path.join(output_path, link))
    else:
        # if the content is not html, then it must be a file
        if not os.path.exists(output_path):
            with open(output_path, 'wb') as f:
                f.write(r.content)
                print("downloaded")
        else:
            print("skipping")

    return

if __name__ == "__main__":
    base_url = "https://downloads.openmicroscopy.org/images/Imaris-IMS"
    output_path = "~/Desktop/sample-images"

    get_files(base_url, os.path.expanduser(output_path))

Find this on GitHub

Last modified Jul 15, 2020