Connect PDF Files using python connector in Bold Data Hub
Adding PDF File Type Support in Filesystem Connector
Overview
This article outlines the steps for connecting and utilizing PDF files within Bold Data Hub using a Python connector.
Steps to Connect PDF Files in Bold Data Hub
Data Extraction Workflow
To efficiently extract data from PDF files, the following inputs are necessary:
Required Inputs
-
Delimiter: A character or string used to divide the data into rows. For example, a colon (:) can be used to separate different pieces of information within the PDF.
-
Schema (Optional): A predefined structure that outlines how the data should be divided into separate columns. If no schema is provided, the data will be split into a single column.
Example
For instance, if invoice data is extracted using a colon as a delimiter, the data might appear as follows:
InvoiceID: 12345
Date: 2023-10-01
Amount: 250.00
In this example, the delimiter (:) separates the different fields, while the schema can define how these fields are organized into columns.
The following custom script has been used to transfer data from the PDF, modify the delimiter and the schema in the file below.
-
Create a New Pipeline
Begin by creating a new pipeline in the Data Hub interface.
-
Choose PythonScript as Connector
Select PythonScript as your connector and click on the “Add Template” button.
-
Upload the YAML File
Use the “Upload File” button to upload your Python file.
-
Select and Upload the File
Click the “Choose File” button to select the file from your local system, then click the “Upload” button.
-
Copy the Filepath
After uploading, use the “Copy” button to copy the filepath of the uploaded YAML file.
-
Save and Schedule the Project
Save your project and set up a schedule for it.
-
Check the Logs
Navigate to the logs tab to check for any updates or errors.
-
Use the Data Source in Bold BI
The data source created using Data Hub can now be utilized in Bold BI.