These are some scripts I generated real quick so I can navigate my 4 GB of PDF data that I collected for four years on the RCMP and CSIS with respect to the 2010 Winter Olympic Games and the G8/G20 Summit in Toronto. This is some very basic Retrieval Augmented Generation work, and I'm sure I can modify this to provide insights into other documents.
The most computationally expensive task is cleaning the data.
This is the future of Data Driven Journalism. Have you ever wanted an LLM that was tuned on a bunch of ATIP data from the Canadian Government and you can ask it questions and get the most unhinged as hell responses from it?