EuroPDF-Extractor README

Disclaimer

This software is experimental and provided as-is, without any express or implied warranties. Use it at your own risk. The authors and contributors assume no responsibility for any issues, damages, or losses that may arise from its use. By using this software, you agree that no liability will be held against the developers under any circumstances. Always test thoroughly before deploying in production environments.

Overview

This Python Knowledge Extraction project allows you to extract and process text from PDF documents using a configuration file for text cleanup and organization. The output is saved as structured JSON files.

Getting Started

Step 1: Set Up the Environment

Clone the repository to your local machine.
Navigate to the project directory.
Create a Python virtual environment:
```
python -m venv env
```
Activate the virtual environment:
- On Windows:
```
.\env\Scripts\activate
```
- On macOS/Linux:
```
source env/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Step 2: Organize Your Files

Store PDF Documents: Place your PDF documents in a designated folder, e.g., data/documents.

Create a Configuration File: Prepare a JSON configuration file for text cleanup and processing. Store it in the appropriate directory, e.g., utils/text_cleanup_config.json.

Example structure of the configuration file:

{
   "special_characters": [
      "•", "▪", "●", "■", "◆", "◦", "*", "-", "–", "—", "→", ""
   ],
   "expressions": [
      "\\n", "\\t", "\\r", "(next page)"
   ],
   "texts_to_remove": [
      "Inferring job vacancies from online job advertisements",
      "Distributional national account estimates for household income and consumption: methodological issues…",
      "An introduction to Large Language Models and their relevance for statistical offices"
   ]
}

Choose an Output Folder: Specify an output folder where the processed JSON files will be saved. If the folder does not exist, it will be created automatically.

Step 3: Run the Script

Run the script from the command line using the following command format:

python extract_pdfs.py --folder_path <path_to_pdf_folder> --output_folder <path_to_output_folder> --config_path <path_to_config_file>

Example Command

python extract_pdfs.py --folder_path data/documents --output_folder data/outputs --config_path utils/text_cleanup_config.json

This will:

Process all PDF files in the data/documents folder.
Use the configuration file located at utils/text_cleanup_config.json.
Save the output JSON files to the data/outputs folder.

Output

The script generates structured JSON files for each PDF document in the specified output folder. Each JSON file contains:

Metadata about the PDF
Leveled text: Hierarchical structuring of text sections
Processed Text: Text sections only
Cleaned text: full plain text cleaned

Note: if 2. or 4. failed, only 1. and 4. will be processed.

Notes

Ensure that all necessary dependencies are installed via requirements.txt.
Verify the correctness of the configuration file to achieve the desired processing and cleanup.

Troubleshooting

Environment Issues: If the virtual environment is not working, ensure you are using the correct Python version.
Missing Folders: Ensure the input folder with PDFs exists. The output folder will be created automatically.
Errors in JSON: Verify the syntax and structure of the configuration file.


You can copy and paste this directly into your README file. It is formatted in Markdown and ready to use.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
utils		utils
LICENSE.txt		LICENSE.txt
README.md		README.md
Technical_document.md		Technical_document.md
app_tests.ipynb		app_tests.ipynb
extract_pdfs.py		extract_pdfs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EuroPDF-Extractor README

Disclaimer

Overview

Getting Started

Step 1: Set Up the Environment

Step 2: Organize Your Files

Step 3: Run the Script

Example Command

Output

Notes

Troubleshooting

About

Releases

Packages

Languages

License

eurostat/EuroPDF-Extractor

Folders and files

Latest commit

History

Repository files navigation

EuroPDF-Extractor README

Disclaimer

Overview

Getting Started

Step 1: Set Up the Environment

Step 2: Organize Your Files

Step 3: Run the Script

Example Command

Output

Notes

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages