Automated PDF and text processing; information extraction from text based on grammatical structure
PDF Plumber extraction techniques; general data cleaning and boxplots of word count / densities; centroid words with TF-IDF and extractive summarisation by ranking; topic modelling and clustering; grammatical trends via dependencies and parts-of-speech
Data preprocessing and word clouds over time periods; statistical analysis - keyword extraction with TF-IDF; comparison against RAKE, GENSIM, Spacy; topic modelling with Latent Dirichlet Analysis; Named Entity Recognition; nouns with Matcher and frequency/momentum analysis; noun pairing and network graphs
Exploratory Data Analysis - frequency-based histograms and subplots; Summarisation with TFIDF centroid vectors; text statistics with PCA, K-means clustering; word2vec; graph centrality; formation of n-grams / phrases