WWDC25: Read documents using the Vision framework | Apple

Apple Developer

10 Jun 202520:22

Summary

TLDRIn this video, Megan Williams, an engineer on the Vision Framework team, introduces new APIs for machine learning in app development. These include the RecognizeDocumentsRequest, which helps extract structured information from documents like tables and lists, and the DetectLensSmudgeRequest, designed to identify images taken with smudged lenses. Megan also highlights improvements in hand pose detection with a modernized model. These new features aim to enhance document parsing, image quality control, and hand gesture recognition, offering developers powerful tools to improve their apps' functionality.

Takeaways

😀 Vision Framework offers APIs for various machine learning tasks such as object detection, body and hand pose tracking, and trajectory analysis, all running entirely on-device for optimal performance and security.
😀 Vision APIs are available on multiple Apple platforms, including iOS, macOS, iPadOS, tvOS, and visionOS, and support 31 different APIs for diverse image analysis tasks.
😀 The new RecognizeDocumentsRequest API allows developers to extract more than just text, also recognizing document structure such as tables, lists, and paragraphs, and supporting 26 languages.
😀 The RecognizeDocumentsRequest API is ideal for extracting structured data from documents, such as sign-up sheets, by recognizing rows, columns, and key information like email addresses or phone numbers.
😀 Developers can use the RecognizeDocumentsRequest API to automate the extraction of structured elements from documents, improving parsing and reducing code complexity.
😀 Vision Framework supports recognizing important data in documents like phone numbers, emails, URLs, and dates through its new DataDetection framework.
😀 The DetectLensSmudgeRequest API helps ensure high-quality images by identifying whether a camera lens is smudged, providing a confidence score to assess the image quality.
😀 With the DetectLensSmudgeRequest, developers can set a threshold confidence score to filter out low-quality images caused by lens smudging.
😀 Other Vision APIs like DetectFaceCaptureQualityRequest and CalculateImageAestheticScoresRequest can be used alongside DetectLensSmudgeRequest to assess overall image quality and aesthetic value.
😀 Vision now includes an updated hand pose detection model, providing better accuracy and efficiency in detecting hand joints, which can be used for gesture control in apps.
😀 Developers with pre-existing hand pose classifiers should retrain their models using the new, smaller, and more accurate hand pose detection model to improve performance.

Q & A

What is the Vision Framework, and what capabilities does it offer?
-The Vision Framework provides APIs to bring machine learning to apps for tasks such as person and object detection, body and hand pose tracking, and trajectory analysis. It runs entirely on-device, offering secure and high-performance computer vision capabilities across multiple Apple platforms including iOS, macOS, iPadOS, tvOS, and visionOS.
What are the new APIs introduced in this video?
-This video introduces two new APIs: RecognizeDocumentsRequest for structured document understanding, and DetectLensSmudgeRequest for identifying photos taken with a smudged camera lens.
What problem does the RecognizeDocumentsRequest API solve?
-RecognizeDocumentsRequest solves the problem of extracting structured information from documents. It can detect elements like tables, lists, and paragraphs, and can also identify important information such as phone numbers, email addresses, and URLs, making it easier for developers to parse documents with fewer lines of code.
How does the RecognizeDocumentsRequest API improve document processing compared to RecognizeTextRequest?
-While RecognizeTextRequest only extracts text lines, RecognizeDocumentsRequest recognizes the structure of the document, including tables, paragraphs, and other elements, allowing for more accurate parsing and extraction of structured data like rows in a table.
Can you explain how the DocumentObservation works in RecognizeDocumentsRequest?
-DocumentObservation is the result returned by RecognizeDocumentsRequest. It provides a hierarchical structure of the document, containing text, tables, lists, and barcodes, allowing developers to easily access and extract structured data like rows and columns from tables.
How does the DetectLensSmudgeRequest help improve image quality in apps?
-DetectLensSmudgeRequest detects if an image is taken with a smudged camera lens, which can degrade image quality. By using this API, developers can prompt users to clean their lens or take a different photo, ensuring only high-quality images are processed.
What kind of data does the DetectLensSmudgeRequest provide?
-The DetectLensSmudgeRequest produces a smudge observation with a confidence score between 0 and 1. The score indicates the probability that the image is smudged, with higher scores suggesting a higher likelihood of smudging.
How can developers handle low-quality images detected by DetectLensSmudgeRequest?
-Developers can filter out poor-quality images by comparing the smudge observation's confidence score to a predefined threshold. By adjusting this threshold, developers can control how strict the filter is for rejecting smudged images.
What other APIs can be used in combination with DetectLensSmudgeRequest to evaluate image quality?
-Developers can use DetectFaceCaptureQualityRequest to assess the quality of face captures or CalculateImageAestheticScoresRequest to evaluate the overall aesthetic quality of an image, particularly when no faces are detected.
What is the significance of the updated hand pose detection model in the Vision Framework?
-The updated hand pose detection model in Vision provides improved accuracy, reduced memory usage, and lower latency. It still detects 21 joints in the hand but requires retraining any existing hand pose or action classifiers to align with the new model's joint locations.