Sourcing image content for your app can be a time-intensive, maddening process. At PLANT GROUP, we’re in the process of building a PlantFinder tool, where users can learn about different plants and their ethnobotanical uses. After manually sourcing the first 100 or so images for the app, and on the verge of letting out a blood curdling existential scream, I decided to make this programming thing work for me. The solution, which we’ll walk through in this post, is to use Python’s Selenium library to trigger a web driver and automate image search and download.
We’ll be walking through the code in this post, but you can also view the source code on Github.
Step 1: Download Chrome web driver. Make sure to remember the path of where you save the driver as you’ll need to use that path in your script.
Step 2: Put your Python script in a folder along with a “downloads” folder and your CSV with the keywords you want to input in Google. In our case, we will be entering the plant’s scientific name (from a column in our CSV) into Google’s search engine.
Step 3: Run your script and watch it automagically open your browser and start downloading images!
Ok, let’s dive into the code!
Step 1: Import libraries
from selenium import webdriver import time import urllib.request import os from selenium.webdriver.common.keys import Keys import csv
Step 2: Image download function. The first lines of this code, open Google in the browser, access the search bar, pass in the key words, and click on the “Images” page. The script then looks for divs with the “img” tag in the HTML.
def imgDownload(key_words): browser =webdriver.Chrome("/Users/<YourUser>/chromedriver") browser.get("https://www.google.com/") search = browser.find_element_by_name('q') #key_words = "Anisocampium niponicum" search.send_keys(key_words,Keys.ENTER) # click on Images page elem = browser.find_element_by_link_text('Images') elem.get_attribute('href') elem.click() #comment out for testing sub = browser.find_elements_by_tag_name("img") # create folder for downloads try: os.mkdir('downloads') except FileExistsError: pass
Step 3: Choose a subset of the results, and do HTTP request to retrieve image. In this example I’ve arbitrary chosen to download the fifth image in the set
for i in sub[1:10]: src = i.get_attribute('src') try: if src != None: src = str(src) print(src) urllib.request.urlretrieve(src, os.path.join('downloads','image'+str(5)+'.jpg')) # rename img name to species name else: raise TypeError except TypeError: print('Fail')
Step 4: Rename image name to species name (this will help you keep track of your downloads).
dst = '/Users/<YourUser>/citizenScience/imgScrape/downloads/'+key_words+'.jpg' src = '/Users/austinarrington/<YourUser>/imgScrape/downloads/'+'image5.jpg' os.rename(src, dst) # close browser browser.close()
Step 5: Copy your keywords into a Python array.
speciesName =  with open('PlantFinder.csv', mode='r') as f: reader = csv.reader(f, delimiter=',') for n, row in enumerate(reader): if not n: continue speciesName.append(row)
Step 6: Call your function and pass in keywords!
print("Done importing species names...") for name in speciesName: print(name) imgDownload(name)
I have somewhat limited space on my computer, so I ended up syncing the folder with image downloads to a Google Drive folder and periodically deleting images from my local device. Running the script the saved me hours of tedious work, but there is one major caveat. Some of the images downloaded looked, great, other’s not so much. Turns out, it’s pretty tough to get a script to match the image curation abilities of a human.
Let’s take a look at what the script came up with for Alyssum montanum:
Hmm…not bad. Though the image resolution could be better. A close up of the flower might also be more useful. But all things considered, pretttty prettttty good.
Here’s an example where the script doesn’t shine as much. Let’s see what it comes up with for the Daphne shrub:
Well…. this is awkward. In the long run, I still had to review the thousands of images and weed out the ridiculous ones. An improved script might have an additional layer of logic — where it checks for image quality, filters out low resolution images, and attempts to download another image. A really fancy version might include computer vision functionality — in order to avoid the very awkward possibility of a Scooby Doo character rendering in the app. Something like Hotdog/Not Hotdog but for plants.
Hope this post was helpful. Let me know (firstname.lastname@example.org) if you make any improvements / modifications to the code — I’d love to hear about it!