This software is experimental and provided as-is, without any express or implied warranties. Use it at your own risk. The authors and contributors assume no responsibility for any issues, damages, or losses that may arise from its use. By using this software, you agree that no liability will be held against the developers under any circumstances. Always test thoroughly before deploying in production environments.
This Python Knowledge Extraction project allows you to extract and process text from PDF documents using a configuration file for text cleanup and organization. The output is saved as structured JSON files.
- Clone the repository to your local machine.
- Navigate to the project directory.
- Create a Python virtual environment:
python -m venv env
- Activate the virtual environment:
- On Windows:
.\env\Scripts\activate
- On macOS/Linux:
source env/bin/activate
- On Windows:
- Install the required dependencies:
pip install -r requirements.txt
-
Store PDF Documents: Place your PDF documents in a designated folder, e.g.,
data/documents
. -
Create a Configuration File: Prepare a JSON configuration file for text cleanup and processing. Store it in the appropriate directory, e.g.,
utils/text_cleanup_config.json
.Example structure of the configuration file:
{ "special_characters": [ "•", "▪", "●", "■", "◆", "◦", "*", "-", "–", "—", "→", "" ], "expressions": [ "\\n", "\\t", "\\r", "(next page)" ], "texts_to_remove": [ "Inferring job vacancies from online job advertisements", "Distributional national account estimates for household income and consumption: methodological issues…", "An introduction to Large Language Models and their relevance for statistical offices" ] }
-
Choose an Output Folder: Specify an output folder where the processed JSON files will be saved. If the folder does not exist, it will be created automatically.
Run the script from the command line using the following command format:
python extract_pdfs.py --folder_path <path_to_pdf_folder> --output_folder <path_to_output_folder> --config_path <path_to_config_file>
python extract_pdfs.py --folder_path data/documents --output_folder data/outputs --config_path utils/text_cleanup_config.json
This will:
- Process all PDF files in the
data/documents
folder. - Use the configuration file located at
utils/text_cleanup_config.json
. - Save the output JSON files to the
data/outputs
folder.
The script generates structured JSON files for each PDF document in the specified output folder. Each JSON file contains:
- Metadata about the PDF
- Leveled text: Hierarchical structuring of text sections
- Processed Text: Text sections only
- Cleaned text: full plain text cleaned
Note: if 2. or 4. failed, only 1. and 4. will be processed.
- Ensure that all necessary dependencies are installed via
requirements.txt
. - Verify the correctness of the configuration file to achieve the desired processing and cleanup.
- Environment Issues: If the virtual environment is not working, ensure you are using the correct Python version.
- Missing Folders: Ensure the input folder with PDFs exists. The output folder will be created automatically.
- Errors in JSON: Verify the syntax and structure of the configuration file.
You can copy and paste this directly into your README file. It is formatted in Markdown and ready to use.