Implementing IIIF Image API Support for PDFs with Cantaloupe

IIIF
16 Jul 202119:22

Summary

TLDRIn this informative session, Shane Huddleston and Dave Collins from OCLC discuss the implementation of Cantaloupe, an image server, to enhance Content DM's digital collection management service. They highlight the challenges of rendering PDFs and images, compliance with IIIF standards, and the integration process. Dave shares technical insights on PDF Box and Cantaloupe's role in transforming vector PDF pages into raster images, addressing performance issues, and preparing for diverse PDF data. The presentation concludes with future plans for IIIF version 3 support and search term highlighting.

Takeaways

  • 📚 OCLC is a nonprofit global library cooperative providing technology services, original research, and community programs for libraries and archives.
  • 🌐 Content DM is OCLC's digital collection management service with over 1400 instances worldwide, supported by data centers in the USA, Canada, Australia, and the Netherlands.
  • 🔍 Before adopting Cantaloupe, OCLC's image server had limitations and did not fully support PDFs, which they aimed to change to broaden their service offerings.
  • 👌 Compliance with IIIF (International Image Interoperability Framework) is essential for OCLC's updated functionality, and they ensured Cantaloupe met these standards.
  • 🤝 Cantaloupe was chosen for its robust support for JP2000, Java implementation, and suitability for a multi-tenant environment, making it a good match for OCLC's needs.
  • 🔧 During implementation, OCLC created a database to support integration with Cantaloupe and transitioned from static to dynamic generation of IIIF presentation manifests.
  • 🛠 Dave Collins highlighted PDF Box as an open-source Java library for handling PDFs, noting its benefits and limitations in terms of performance and memory management.
  • 📈 OCLC found that nearly 50% of the content served was PDF, emphasizing the necessity to support PDFs alongside image formats in their digital collection management.
  • 🔄 Challenges were faced with Cantaloupe's initial PDF access method, which was not IIIF compliant; a solution was developed in collaboration with the Cantaloupe team to meet standards.
  • 🚀 OCLC is approximately 60% through the process of installing Cantaloupe across all Content DM sites, with data preparation and bug fixes part of the ongoing process.
  • 🔄 They are currently limiting the PDF size for the IIIF image API due to server load considerations, but this is an area they plan to revisit and improve.

Q & A

  • What is OCLC and what services does it provide?

    -OCLC is a nonprofit global library cooperative that provides shared technology services, original research relevant to libraries and archives, and community programs.

  • What is Content DM and how is it used globally?

    -Content DM is a digital collection management service used by over 1400 separate instances globally, served from four data centers in the USA, Canada, Australia, and the Netherlands.

  • Why did OCLC decide to broaden its support to include PDFs in addition to images?

    -OCLC wanted to expand its capabilities beyond just image records to support PDFs due to inherent limitations in their existing image server and to comply with IIIF (International Image Interoperability Framework) standards.

  • What is Cantaloupe and why was it chosen by OCLC?

    -Cantaloupe is an image server that handles the generation of image tiles and zoom levels from high-resolution originals. It was chosen by OCLC because it is implemented in Java, works well in a multi-tenant environment, and has robust support for JP2000, the format used for storing all image derivatives.

  • What is the significance of compliance with IIIF standards for OCLC's updated functionality?

    -Compliance with IIIF standards is a must-have for OCLC's updated functionality to ensure that their services are interoperable with a wide range of viewers and systems that support IIIF.

  • What challenges did OCLC face when implementing Cantaloupe for PDF support?

    -OCLC faced challenges such as performance problems and memory management issues with open JP2 implementations in Cantaloupe, which were resolved by using a licensed extension called Cockatoo.

  • How did OCLC address the issue of updating and maintaining millions of IIIF presentation manifests?

    -OCLC switched from static generation to dynamic generation of IIIF presentation manifests to address issues with updating and maintaining the manifests.

  • What is PDF Box and how does it relate to the Cantaloupe implementation?

    -PDF Box is an open-source library written in Java used for working with PDF documents. It is used in conjunction with Cantaloupe to convert PDF pages from vector format into raster images for display.

  • What was the solution to the issue of Cantaloupe's initial flawed implementation for accessing PDF pages?

    -The solution was to incorporate the page number as a sub-delimiter of the identifier in the IIIF URL, using a semi-colon, which aligns with IIIF standards and works in viewers like Mirador.

  • How did OCLC handle the challenge of diverse and large PDF files in their system?

    -OCLC worked with the Cantaloupe team to implement a feature of Apache PDF Box that allows for file storage allocation instead of memory for rendering, helping to manage the rendering of large PDFs more efficiently.

  • What is the current status of the Cantaloupe implementation project at OCLC?

    -OCLC is about 60% of the way through installing Cantaloupe to all Content DM sites, with data preparation and some file conversions still in progress.

  • What future improvements does OCLC plan to make after completing the Cantaloupe installation?

    -OCLC plans to potentially support IIIF version three, implement search term highlighting in IIIF viewers, and explore options for handling extremely large PDFs, such as creating pre-rendered static images.

  • How does OCLC ensure user feedback drives the development and improvement of their services?

    -OCLC listens to user feedback as they roll out and publicize new features like Cantaloupe and the availability of PDFs through the IIIF image API, using this feedback to guide future enhancements.

Outlines

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Mindmap

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Keywords

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Highlights

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Transcripts

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now
Rate This

5.0 / 5 (0 votes)

Related Tags
PDF RenderingDigital LibrariesOCLCContent DMImage APIsIIIFCantaloupeJavaJPEG 2000Apache PDFBoxBook Reader