AI Revolutionizes Botanical Data Access from Herbarium Collections

Theconversation

For centuries, herbaria around the globe have meticulously preserved a unique chronicle of Earth’s plant and fungal life. From a specimen of Epaltes australis collected by Joseph Banks and Daniel Solander in 1770, just after Captain Cook’s Endeavour was repaired on the Great Barrier Reef, to the 170,000 specimens housed at the University of Melbourne, these collections collectively contain over 395 million irreplaceable records. This vast botanical archive holds immense potential for understanding biodiversity, evolution, and climate change, yet accessing its full wealth of information has long been a formidable challenge.

The primary obstacle lies in digitizing these physical collections. While institutions worldwide strive to photograph each specimen at high resolution and convert its label information into searchable digital data, the sheer scale of the task is daunting. Once digitized, these records feed into global platforms like the Australasian Virtual Herbarium and the Global Biodiversity Information Facility, making centuries of botanical knowledge accessible to researchers everywhere. However, even large herbaria, such as the National Herbarium of New South Wales, which utilized high-capacity conveyor belt systems, took over three years to digitize 1.15 million specimens. For smaller institutions lacking industrial-scale setups, the process is far slower, relying on staff, volunteers, and citizen scientists who painstakingly photograph and manually transcribe labels. At the current pace, many collections will remain undigitized for decades, locking away critical biodiversity data urgently needed by researchers in ecology, evolution, climate science, and conservation.

To overcome this bottleneck, new research has introduced Hespi, an open-source, AI-driven tool designed to revolutionize access to herbarium data. Short for “herbarium specimen sheet pipeline,” Hespi integrates advanced computer vision with artificial intelligence, including object detection, image classification, and sophisticated language models. The process begins with a high-resolution image of a specimen sheet, which typically includes the pressed plant and identifying text. Hespi then employs optical character recognition to read printed text and handwritten text recognition to decipher handwritten notes—a task challenging even for humans. To further enhance accuracy, the extracted text is processed by an advanced AI model, such as OpenAI’s GPT-4o, which corrects errors and significantly improves the quality of the digital output.

In mere seconds, Hespi can locate the main specimen label on a sheet and extract vital information, including taxonomic names, collector details, geographical location, latitude and longitude coordinates, and collection dates. This data is then converted into a digital format, ready for immediate use in research. For instance, Hespi accurately processed a large brown algae specimen collected in St Kilda in 1883, identifying all key details. Extensive testing on thousands of specimen images from the University of Melbourne Herbarium and other global collections has demonstrated Hespi’s high degree of accuracy, promising substantial time savings compared to manual data extraction. Future developments include a user-friendly graphical interface to allow curators to review and correct results.

The impact of AI systems like Hespi extends far beyond simple digitization. Herbaria already contribute immensely to society through species identification, taxonomy, ecological monitoring, conservation efforts, education, and even forensic investigations. By mobilizing vast volumes of specimen-associated data, AI enables innovative applications at an unprecedented scale. For example, AI has been used to automatically extract detailed leaf measurements and other traits from digitized specimens, making centuries of historical collections available for rapid research into plant evolution and ecology. This is merely the beginning, as computer vision and AI are poised to further accelerate and expand botanical research in countless ways.

The potential of AI pipelines like Hespi reaches beyond herbaria to any museum or archival collection with high-quality digital images. A new collaboration with Museums Victoria aims to adapt Hespi for museum collections, starting with the digitization of approximately 12,500 specimens from the museum’s globally significant fossil graptolite collection. Additionally, a project with the Australian Research Data Commons (ARDC) is underway to make the software even more flexible, allowing curators in various institutions to customize Hespi for extracting data from diverse collections, not just plant specimens. Just as AI is reshaping many aspects of daily life, these technologies are set to transform access to biodiversity data, facilitating human-AI collaborations to overcome the significant bottleneck of slow, manual transcription. Mobilizing the information locked away in herbaria, museums, and archives worldwide is critical for the cross-disciplinary research needed to understand and address the escalating biodiversity crisis.