Blog

Download web images to Google Drive automatically using Apps Script

today February 1, 2023

Your team keeps a large storage of high-quality images on some web server, but the guy who manages the server was caught in the latest round of layoffs, and now nobody knows how to access and get these images. Luckily, Apps Script can automatically download these images to Google Drive, so no need to fret. But before we get into the code, I must make three disclaimers:

Disclaimer #1: Abide by copyright information. If you use the script below to download images from websites that you do not own then please ascertain that the website owners are ok with you doing so. Look for copyright or download information on the site first.

Disclaimer #2: Respect the robots.txt file. If you intend to download images from a website that you don't own then first inspect the robots.txt file that usually lives at the root of the website. The file contains information about which sections of the site bots such as the upcoming script are allowed or disallowed to access. Ensure that the website owners are ok with you crawling their site with a script.

Disclaimer #3: The script below uses Regular Expressions to find image URLs in the HTML code. Plain regular expressions are notoriously bad at parsing different HTML formats. A formal web parser will do a much better job.

Now that we got this out of the way, let's jump into the code. You'll need to create a Google Apps Script file and a Google Drive folder that will store the images. The script file will cover the following actions:

  1. Fetch the web page HTML that contains the images
  2. Get all image URLs
  3. Standardize image URLs
  4. Download image resources
  5. Inspect file size
  6. Save files to Google Drive folder
Interested in customizing this script? Contact me

Let's create a global object with key information that we will refer to throughout the script:

const g = {
  PAGE_URL: 'https://mywebsite.com/path/my-page.html',
  FOLDER_ID: 'my_folder_id',
  MIN_KB: 30,
};

Above, we store references to the web page URL from which we want to download the image files. We also note the ID of the Google Drive folder that will store the image files. Lastly, we note the minimum file size to download (in kilobytes), since we don't want to download 1x1 pixels or tiny icon files.

Next, we'll define the function that we will execute the script with. First, the function will fetch the web page HTML and then use Regular Expressions to get the image URLs:

function downloadImages() {
  const html = UrlFetchApp.fetch(g.PAGE_URL).getContentText();
    const re = /<img[^>]+src="([^"]+(?:png|gif|jpg|jpeg|svg))"[^>]*>/g;
  let match;
  let sources = [];
  while ((match = re.exec(html)) !== null) {
    sources.push(match[1]);
  }
}

Above we use UrlFetchApp to fetch the web page. getContextText converts it to an HTML string that we can parse. Next we define a regular expression: We are looking for strings inside the HTML that contain image tags, and we extract the src value of each image.

You may find that with the web pages you access, you might need to modify the regex. By all means do so: this script is for educational purposes only.

Next, we need to standardize the image URLs, since those can either have absolute paths or relative paths. By that I mean, a URL can be of the following forms:

  • https://www.mysite.com/img/image1.png
  • image.png
  • /image.png
  • ./image.png
  • ../image.png
  • ../../image.png

Web browsers do a great job in handling relative path URLs, but we need to make them absolute so that we can fetch their resources. Let's create the function to do so:

function formatUrl_(url) {
  if (url.startsWith('http')) {
    return url;
  }
  let pageUrl = g.PAGE_URL;
  if (pageUrl.endsWith('/')) {
    pageUrl = pageUrl.slice(0, -1);
  }
  if (/^[a-zA-Z]/.test(url)) {
    url = './' + url;
  }
  if (url.startsWith('/')) {
    url = '.' + url;
  }
  pageUrl = pageUrl.split('/');
  url = url.split('/');
  if (pageUrl[pageUrl.length - 1].includes('.')) {
    pageUrl.pop();
  }
  while (url[0] == '..' || url[0] == '.') {
    pageUrl.pop();
    url.shift();
  }
  url = pageUrl.concat(url).join('/');

  return url;
}

In the formatUrl_ function above, we first check if the url has an absolute path: if it does then we return it as-is. Next, we copy the page url to its own variable so that we can turn it into the base URL. We strip off a trailing slash if there is one. We then standardize the file url: we want it to begin with "./".

Next, we split both URLs into arrays. If the base URL ends with a file name (which includes a period) then we remove that part. We create a loop to inspect the first array element of the file URL: if it contains './' or '../' then we delete it and remove the last sub-directory from the base URL. Finally, we concat the two arrays and join them with a slash before returning the URL.

We can create a test function to ensure that the code works as expected for any type of file URL:

function dev() {
  let url = 'https://mysite.com/img/image1.png';
  console.log(formatUrl_(url));
  url = 'image1.png';
  console.log(formatUrl_(url));
  url = '/img/image1.png';
  console.log(formatUrl_(url));
  url = './image1.png';
  console.log(formatUrl_(url));
  url = '../image1.png';
  console.log(formatUrl_(url));
  url = '../../image1.png';
  console.log(formatUrl_(url));
}

When we run it, we get an absolute URL for any variation.

Back in our downloadImages function, we can now standardize our URLs using the new function:

let imageUrls = sources.map((url) => formatUrl_(url));
imageUrls = [...new Set(imageUrls)];

We could have a situation where the page includes a given image more than once. Since we don't want to download duplciate images, we can use "set" to de-dup the list of image URLs.

Next, I want to use the "fetchAll" method of "UrlFetchApp." You can execute the fetches individually, but I prefer to do so in one time. So, I need to convert the URLs to request objects, and wrap the requests in a try/catch:

let requests = imageUrls.map((url) => ({
    url,
    muteHttpExceptions: true,
  }));
  let resp;
  try {
    resp = UrlFetchApp.fetchAll(requests);
  } catch (err) {
    console.log(err);
  }
  console.log(`Found ${resp.length} total images`);

Above, we use "map" to convert each URL string to an object with the URL, and with muteHttpExceptions so that I can inspect any error. Lastly, I log the number of images that were fetched.

Next, I want to remove any images that are smaller in file size than the threshold I had specified in my global object:

  for (let i = resp.length - 1; i >= 0; i--) {
    const fileSize = resp[i].getAllHeaders()['Content-Length'];
    if (fileSize / 1000 < g.MIN_KB) {
      resp.splice(i, 1);
    }
  }
  console.log(`Will save ${resp.length} image files`);

I iterate over the responses backwards. For each one, I get the headers and look at the 'Content-Length', which provides the file size in bytes. I divide the number by 1000 and compare it to g.MIN_KB. If it's smaller than I splice out the file. Finally, I log the number of actual files that will be stored. So, let's store the files:

resp.forEach((r, i) => {
  const blob = r.getBlob();
  const file = folder.createFile(blob);
  console.log(`Saved file ${++i} ${file.getName()}`);
});

Above, we iterate over the responses. For each one we get its blob and create a file from it. We log the number of file saved and its name. That's all there's to it.

Interested in customizing this script? Contact me