Connect PDF Files using python connector in Bold Data Hub

Published:

This article outlines the steps for connecting and utilizing PDF files within Bold Data Hub using a Python connector.

To efficiently extract data from PDF files, the following inputs are necessary:

Delimiter: A character or string used to divide the data into rows. For example, a colon (:) can be used to separate different pieces of information within the PDF.
Schema (Optional): A predefined structure that outlines how the data should be divided into separate columns. If no schema is provided, the data will be split into a single column.

For instance, if invoice data is extracted using a colon as a delimiter, the data might appear as follows:

InvoiceID: 12345
Date: 2023-10-01
Amount: 250.00

In this example, the delimiter (:) separates the different fields, while the schema can define how these fields are organized into columns.

The following custom script has been used to transfer data from the PDF, modify the delimiter and the schema in the file below.

Create a New Pipeline

Begin by creating a new pipeline in the Data Hub interface.
Choose PythonScript as Connector

Select PythonScript as your connector and click on the “Add Template” button.
Upload the YAML File

Use the “Upload File” button to upload your Python file.
Select and Upload the File

Click the “Choose File” button to select the file from your local system, then click the “Upload” button.
Copy the Filepath

After uploading, use the “Copy” button to copy the filepath of the uploaded YAML file.
Save and Schedule the Project

Save your project and set up a schedule for it.
Check the Logs

Navigate to the logs tab to check for any updates or errors.
Use the Data Source in Bold BI

The data source created using Data Hub can now be utilized in Bold BI.