InfoCapability

Ingesting PDFs into Weaviate with Unstructured & Donut

AI Impact Summary

This demo outlines the process of ingesting PDF documents into Weaviate using Unstructured, leveraging OCR and multimodal deep learning models like LayoutLMv3 and Donut to extract text and visual information. The core workflow involves partitioning, cleaning, and staging the PDF content, followed by vectorization using OpenAI’s text2vec-openai model and storage within Weaviate. This approach enables semantic search and retrieval of information from PDFs, opening possibilities for applications like ChatPDF or ChatDOC.

Affected Systems

UnstructuredWeaviate

Date: Date not specified
Change type: capability
Severity: info

Ingesting PDFs into Weaviate with Unstructured & Donut

More from Weaviate

Get alerts for Weaviate