Ingesting PDFs into Weaviate with Unstructured & Donut
AI Impact Summary
This demo outlines the process of ingesting PDF documents into Weaviate using Unstructured, leveraging OCR and multimodal deep learning models like LayoutLMv3 and Donut to extract text and visual information. The core workflow involves partitioning, cleaning, and staging the PDF content, followed by vectorization using OpenAI’s text2vec-openai model and storage within Weaviate. This approach enables semantic search and retrieval of information from PDFs, opening possibilities for applications like ChatPDF or ChatDOC.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info