Skip to content

Understanding of POS tags and build a POS tagger from scratch

Notifications You must be signed in to change notification settings

jalajthanaki/POS-tag-workshop

Repository files navigation

All about Part of Speech (POS) tags

Understanding of POS tags and build a POS tagger from scratch

This repository is basically provides you basic understanding on POS tags. I have tried to build the custom POS tagger using Treebank dataset.


Workshop Outline

There are main three sections here.

Section 1. Introduction to Part of Speech tags

          1.1 What is Parts of Speech?

          1.2 What is Parts of Speech tagging?

          1.3 What is Part of Speech tagger?

          1.4 What are the various types of the Part of Speech tags?

          1.5 Which applications are using POS tagging?

Section 2. Generate Part of Speech tags using various python libraries
   
           2.1 Generating POS tags using Polyglot library
   
           2.2 Generating POS tags using Stanford CoreNLP 
   
           2.3 Generating POS tags using Spacy library
           
           2.4 Why do we need to develop our own POS tagger?

Section 3. Build our own statistical POS tagger form scratch
   
           3.1 Import dependencies

           3.2 Explore dataset

                3.2.1 Explore Brown Corpus

                3.2.2 Explore Penn-Treebank Corpus

            3.3 Generate features

            3.4 Transform Dataset

            3.5 Build training and testing dataset       

            3.6 Train model

            3.7 Measure Accuracy

            3.8 Generate POS tags for given sentence

Dependencies

  • Python 3.3+

  • Polyglot

  • Spacy

  • Py-CoreNLP (uses Stanford CoreNLP)

  • NLTK

  • Scikit-learn

  • jupyter notebook

Installation Instructions

General instructions

For section 1:

  • No dependency required for this section.

For section 2:

There are three dependencies are required.

2.1. Polyglot

2.2. Stanford CoreNLP and Py-CoreNLP

2.3. Spacy POS tagger


Windows OS

2.1. Polyglot

For installation refer this link

Step 1: $ git clone https://github.com/aboSamoor/polyglot.git

Step 2: $ python setup.py install

Step 3: Downloaded and pip install
        
        $ pip install pycld2-0.31-cp36-cp36m-win_amd64.whl
        
        $ pip install PyICU-1.9.8-cp36-cp36m-win_amd64.whl

2.2. Stanford POS tagger

Step 1: Install JDK 1.8 using this link

Step 2: Download Stanford CoreNLP from this link

    Step 2.1: Download and extract the Stanford CoreNLP

Step 3: Start service of Stanford CoreNLP

    Step 1: cd ~/Path where you extract the Stanford CoreNLP
    
    Step 2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
            
            $ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Step 4: Setup Py-coreNLP

2.3. Spacy POS tagger

Step 1: Documentation for Spacy is here

Step 2: Run the installation commands

       $ sudo pip iinstall spacy
       
       $ sudo python3 -m spacy download en

Linux OS

2.1. Polyglot

Step 1: sudo apt-get update
Step 2: sudo apt-get install python-pyicu
Step 3: sudo pip install pycld2  
Step 4: sudo pip install Morfessor
Step 5: sudo apt-get install python-numpy libicu-dev
Step 6: sudo pip install PyICU
Step 7: sudo pip install polyglot

2.2. Stanford POS tagger

Step 1: Install JDK 1.8

    Step 1.1: $ sudo mkdir /usr/lib/jvm
    
    Step 1.2: $ sudo tar xzvf jdk1.8.0_172.tar.gz -C /usr/lib/jvm
    
    Step 1.3: Set environment variable for java in .bashrc file
    
              $ sudo vi ~/.bashrc or sudo gedit ~/.bashrc
    
    Step 1.4: Set path at the end of the bashrc file
    
              JAVA_HOME=/usr/lib/jvm/jdk1.7.0_51
              PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
              JRE_HOME=$HOME/bin:$JRE_HOME/bin
              export JAVA_HOME
              export JRE_HOME
              export PATH

Step 2: Download Stanford CoreNLP from this link

        Step 2.1: Download and extract the Stanford CoreNLP

Step 3: Start service of Stanford CoreNLP

        Step 1: cd ~/Path where you extract the Stanford CoreNLP
    
        Step 2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
            
                $ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Step 4: Setup Py-coreNLP

        $ sudo pip install pycorenlp

step 5: Docker image (Instead of Step 1 to 4 above)

        docker run -p 9000:9000 --rm -it motiz88/corenlp       

2.3. Spacy POS tagger

Step 1: Documentation for Spacy is here

Step 2: Run the installation commands

           $ sudo pip iinstall spacy
           
           $ sudo python3 -m spacy download en

Mac-OS

2.1. Polyglot

Step 1: sudo apt-get update
Step 2: sudo apt-get install python-pyicu
Step 3: sudo pip install pycld2  
Step 4: sudo pip install Morfessor
Step 5: sudo apt-get install python-numpy libicu-dev
Step 6: sudo pip install PyICU
Step 7: sudo pip install polyglot

2.2. Stanford POS tagger

Setup Standford CoreNLP

Step 1: Install JDK 1.8 using this steps

Step 2: Download Stanford CoreNLP from this link

    Step 2.1: Download and extract the Stanford CoreNLP

Step 3: Start service of Stanford CoreNLP

    Step 3.1: cd ~/Path where you extract the Stanford CoreNLP
    
    Step 3.2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
            
            $ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Step 4: Setup Py-coreNLP

    Step 4.1: $ sudo pip install pycorenlp

2.3. Spacy POS tagger

Step 1: Documentation for Spacy is here

Step 2: Run the installation commands

           $ sudo pip iinstall spacy
           
           $ sudo python3 -m spacy download en

For section 3:

There are two dependencies are required.


3.1. NLTK

3.2. Scikit-learn


Windows OS

3.1 NLTK

  Step 1: $ sudo pip install numpy scipy nltk
  
  Step 2: Download NLTK data
          
          $ python2 or python3
          
  Step 3: Inside python shell
          
          >>> import nltk
          
          >>> nltk.download()

3.2 Scikit-learn

  $ sudo pip install scikit-learn

For Linux OS

3.1 NLTK

  Step 1: $ sudo pip install numpy scipy nltk
  
  Step 2: Download NLTK data
          
          $ python2 or python3
          
  Step 3: Inside python shell
          
          >>> import nltk
          
          >>> nltk.download()

3.2 Scikit-learn

  $ sudo pip install scikit-learn

On Mac OS

3.1 NLTK

  Step 1: $ sudo pip install numpy scipy nltk
  
  Step 2: Download NLTK data
          
          $ python2 or python3
          
  Step 3: Inside python shell
          
          >>> import nltk
          
          >>> nltk.download()

3.2 Scikit-learn

  $ sudo pip install scikit-learn

Install jupyter notebook

  • For installation you can refer this link

  • In anaconda jupyter notebook is built-in given.

  • You can install jupyter notebook by using following command

    $ sudo pip install jupyter notebook

  • In order to start jupyter notebook execute the given command on cmd/terminal

    $ jupyter notebook

Usage

  • For Session 1: Use Introduction_to_POS ipython notebook

  • For Session 2: Use POS_tagger_Demo ipython notebook

  • For Session 3: Use POS_from_scratch ipython notebook

Share this Git-Pitch Presentation

See the Git-Pitch presentation using this link

Special Thanks

Thanks DataGiri/GreyAtom for hosting this event.

About

Understanding of POS tags and build a POS tagger from scratch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published