This project aims to automate the processing of invoices using PDF extraction and data manipulation techniques.
After setting up all the files, the project structure will look like this:
project/
├── docs/
│ ├── setup.md
│ ├── data_processing.md
│ ├── file_extraction.md
│ ├── zip_data_processing.md
│ ├── pdf_operations.md
│ └── exception.md
├── InvoicesData/
│ └── TestDataSet/
│ └── Pdfs
├── src/
│ ├── logging_utils.py
│ ├── pdf_operations.py
│ ├── file_extraction.py
│ ├── fail_file_extraction.py
│ ├── exception_handler.py
│ ├── zip_data_processing.py
│ ├── data_processing.py
│ └── client_config.json
├── output
│ ├──failed/
│ │ └── output81.json (File generated after executing fail_file_extraction.py)
│ ├── LogFile.log (File generated after file_extraction.py)
│ ├── failed_files.txt (File generated after executing file_extraction.py)
│ ├── invoice.json (File generated after executing file_extraction.py)
│ └── exception.json (File generated after executing exception_handler.py)
│
├── pdfservices-api-credentials.json(You have to setup according setup.md)
├── private.key(You have to setup according setup.md)
├── output.csv (File generated after executing data_processing.py)
└── README.md
- Setup the project according to setup.md
- If you have any doubts in the source code docs
- Furthur any doubts post in github
- Place the PDF files to be processed in the source folder specified in
file_extraction.py
. - Run
file_extraction.py
to initiate the processing of the PDF files. - The script will extract relevant data from the PDF files, update the master data, and save it in the
invoice.json
file. - If any files fail to process initially, the script will retry a maximum number of times specified by
MAX_RETRY_LIMIT
infile_extraction.py
. - If there is any problem from the user-end like finishing of API quota or network issues, the files will be written into
failed_files.txt
, and you can runfail_file_extraction.py
directly to process the remaining files after solving the user-end problems. - If the maximum retry limit is reached and there are still failed files, the script will save the list of failed files in
failed_files.txt
and save the json data in the failed folder. - Run
exception_handler.py
to process the failed files in the failed folder separately and generate the data inexception.json
. - Now run the
data_processing.py
by specifying the paths to theinvoice.json
andexception.json
the output will be displayed intooutput.csv
Note: Make sure to set up the necessary credentials and configurations for the Adobe PDF Services API as described in the project documentation.
The project relies on the following dependencies:
python 3.x
,adobe-pdfservices-sdk
,logging
,json
,tempfile
,csv
,re
,zipfile
Make sure to install the dependencies using the appropriate package manager or pip
.