Understanding of POS tags and build a POS tagger from scratch
This repository is basically provides you basic understanding on POS tags. I have tried to build the custom POS tagger using Treebank dataset.
Section 1. Introduction to Part of Speech tags
1.1 What is Parts of Speech?
1.2 What is Parts of Speech tagging?
1.3 What is Part of Speech tagger?
1.4 What are the various types of the Part of Speech tags?
1.5 Which applications are using POS tagging?
Section 2. Generate Part of Speech tags using various python libraries
2.1 Generating POS tags using Polyglot library
2.2 Generating POS tags using Stanford CoreNLP
2.3 Generating POS tags using Spacy library
2.4 Why do we need to develop our own POS tagger?
Section 3. Build our own statistical POS tagger form scratch
3.1 Import dependencies
3.2 Explore dataset
3.2.1 Explore Brown Corpus
3.2.2 Explore Penn-Treebank Corpus
3.3 Generate features
3.4 Transform Dataset
3.5 Build training and testing dataset
3.6 Train model
3.7 Measure Accuracy
3.8 Generate POS tags for given sentence
-
Python 3.3+
-
Polyglot
-
Spacy
-
Py-CoreNLP (uses Stanford CoreNLP)
-
NLTK
-
Scikit-learn
-
jupyter notebook
-
Set up python package manager pip
-
See: How to install python libraries using conda
-
See: How to install python libraries using pip
- No dependency required for this section.
2.1. Polyglot
2.2. Stanford CoreNLP and Py-CoreNLP
2.3. Spacy POS tagger
2.1. Polyglot
For installation refer this link
Step 1: $ git clone https://github.com/aboSamoor/polyglot.git
Step 2: $ python setup.py install
Step 3: Downloaded and pip install
$ pip install pycld2-0.31-cp36-cp36m-win_amd64.whl
$ pip install PyICU-1.9.8-cp36-cp36m-win_amd64.whl
2.2. Stanford POS tagger
Step 1: Install JDK 1.8 using this link
Step 2: Download Stanford CoreNLP from this link
Step 2.1: Download and extract the Stanford CoreNLP
Step 3: Start service of Stanford CoreNLP
Step 1: cd ~/Path where you extract the Stanford CoreNLP
Step 2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
Step 4: Setup Py-coreNLP
2.3. Spacy POS tagger
Step 1: Documentation for Spacy is here
Step 2: Run the installation commands
$ sudo pip iinstall spacy
$ sudo python3 -m spacy download en
2.1. Polyglot
Step 1: sudo apt-get update
Step 2: sudo apt-get install python-pyicu
Step 3: sudo pip install pycld2
Step 4: sudo pip install Morfessor
Step 5: sudo apt-get install python-numpy libicu-dev
Step 6: sudo pip install PyICU
Step 7: sudo pip install polyglot
2.2. Stanford POS tagger
Step 1: Install JDK 1.8
Step 1.1: $ sudo mkdir /usr/lib/jvm
Step 1.2: $ sudo tar xzvf jdk1.8.0_172.tar.gz -C /usr/lib/jvm
Step 1.3: Set environment variable for java in .bashrc file
$ sudo vi ~/.bashrc or sudo gedit ~/.bashrc
Step 1.4: Set path at the end of the bashrc file
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_51
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
JRE_HOME=$HOME/bin:$JRE_HOME/bin
export JAVA_HOME
export JRE_HOME
export PATH
Step 2: Download Stanford CoreNLP from this link
Step 2.1: Download and extract the Stanford CoreNLP
Step 3: Start service of Stanford CoreNLP
Step 1: cd ~/Path where you extract the Stanford CoreNLP
Step 2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
Step 4: Setup Py-coreNLP
$ sudo pip install pycorenlp
step 5: Docker image (Instead of Step 1 to 4 above)
docker run -p 9000:9000 --rm -it motiz88/corenlp
2.3. Spacy POS tagger
Step 1: Documentation for Spacy is here
Step 2: Run the installation commands
$ sudo pip iinstall spacy
$ sudo python3 -m spacy download en
2.1. Polyglot
Step 1: sudo apt-get update
Step 2: sudo apt-get install python-pyicu
Step 3: sudo pip install pycld2
Step 4: sudo pip install Morfessor
Step 5: sudo apt-get install python-numpy libicu-dev
Step 6: sudo pip install PyICU
Step 7: sudo pip install polyglot
2.2. Stanford POS tagger
Setup Standford CoreNLP
Step 1: Install JDK 1.8 using this steps
Step 2: Download Stanford CoreNLP from this link
Step 2.1: Download and extract the Stanford CoreNLP
Step 3: Start service of Stanford CoreNLP
Step 3.1: cd ~/Path where you extract the Stanford CoreNLP
Step 3.2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
Step 4: Setup Py-coreNLP
Step 4.1: $ sudo pip install pycorenlp
2.3. Spacy POS tagger
Step 1: Documentation for Spacy is here
Step 2: Run the installation commands
$ sudo pip iinstall spacy
$ sudo python3 -m spacy download en
There are two dependencies are required.
3.1. NLTK
3.2. Scikit-learn
3.1 NLTK
Step 1: $ sudo pip install numpy scipy nltk
Step 2: Download NLTK data
$ python2 or python3
Step 3: Inside python shell
>>> import nltk
>>> nltk.download()
3.2 Scikit-learn
$ sudo pip install scikit-learn
3.1 NLTK
Step 1: $ sudo pip install numpy scipy nltk
Step 2: Download NLTK data
$ python2 or python3
Step 3: Inside python shell
>>> import nltk
>>> nltk.download()
3.2 Scikit-learn
$ sudo pip install scikit-learn
3.1 NLTK
Step 1: $ sudo pip install numpy scipy nltk
Step 2: Download NLTK data
$ python2 or python3
Step 3: Inside python shell
>>> import nltk
>>> nltk.download()
3.2 Scikit-learn
$ sudo pip install scikit-learn
-
For installation you can refer this link
-
In anaconda jupyter notebook is built-in given.
-
You can install jupyter notebook by using following command
$ sudo pip install jupyter notebook
-
In order to start jupyter notebook execute the given command on cmd/terminal
$ jupyter notebook
-
For Session 1: Use
Introduction_to_POS
ipython notebook -
For Session 2: Use
POS_tagger_Demo
ipython notebook -
For Session 3: Use
POS_from_scratch
ipython notebook
See the Git-Pitch presentation using this link
Thanks DataGiri/GreyAtom for hosting this event.