Natural Language Processing with NLTK – Part 1

Preparation

In [1]:
import nltk

Download all the packages

In [2]:
#nltk.download()

Tokenisation

Two types

  1. Word tokenisers – separate by words
  2. Sentence tokenisers – separate by sentences

Terminology

  • Corpora – body of text (e.g. medical journals, presidential speeches)
  • Lexicon – words and their meanings (e.g. investor-speak dictionary vs regular english-speak dictionary, i.e. bull)
In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize
In [4]:
example = 'Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue.'

Sentence tokenisers

In [5]:
sent_tokenize(example)
Out[5]:
['Hello Mr. Smith, how are you doing today?',
 'The weather is great and Python is awesome.',
 'The sky is pinkish-blue.']

Word tokenisers – Punctuation is treated as a word

In [6]:
word_tokenize(example)
Out[6]:
['Hello',
 'Mr.',
 'Smith',
 ',',
 'how',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'The',
 'weather',
 'is',
 'great',
 'and',
 'Python',
 'is',
 'awesome',
 '.',
 'The',
 'sky',
 'is',
 'pinkish-blue',
 '.']

Stop Words

Usually the common words/filter words like ‘a’, ‘the’ etc… Basically words that doens’t add meanings to a text

In [7]:
from nltk.corpus import stopwords
In [8]:
example_2 = 'This is an example showing off stop word filtration.'
In [9]:
stop_words = set(stopwords.words('english'))
In [10]:
words = word_tokenize(example_2)
In [11]:
filtered_sent = []

for w in words:
    if w not in stop_words:
        filtered_sent.append(w)
In [12]:
filtered_sent
Out[12]:
['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']

You can do the above (removing stopwords) in the following one line

In [13]:
filtered_sent2 = [w for w in words if w not in stop_words]
In [14]:
filtered_sent2
Out[14]:
['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']

Stemming

Text preprocessing where you only keep the root of a word. For example, ‘run’, ‘ran’, ‘running’ etc.. can all be tie to their root word ‘run’.

In [15]:
from nltk.stem import PorterStemmer
In [16]:
ps = PorterStemmer()
In [17]:
example_words = ['python','pythoner','pythoning','pythoned','pythonly']
In [18]:
for w in example_words:
    print (ps.stem(w))
python
python
python
python
pythonli
In [19]:
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
In [20]:
new_text_words = word_tokenize(new_text)
In [21]:
for w in new_text_words:
    print(ps.stem(w))
It
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.

Part of Speech Tagging & Chunking (grouping) & Chinking (removing)

In [22]:
from nltk.corpus import state_union
In [23]:
from nltk.tokenize import PunktSentenceTokenizer
In [24]:
sample = state_union.raw('2006-GWBush.txt')
In [25]:
train_text = state_union.raw('2005-GWBush.txt')
In [26]:
custom_sent_tokeniser = PunktSentenceTokenizer(train_text)
In [27]:
tokenized = custom_sent_tokeniser.tokenize(sample)
In [28]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            #chunked.draw()
            print(chunked)
    except Exception as e:
        print(str(e))
In [29]:
process_content()
(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
(S
  (Chunk Mr./NNP Speaker/NNP)
  ,/,
  (Chunk Vice/NNP President/NNP Cheney/NNP)
  ,/,
  members/NNS
  of/IN
  (Chunk Congress/NNP)
  ,/,
  members/NNS
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP)
  and/CC
  diplomatic/JJ
  corps/NN
  ,/,
  distinguished/JJ
  guests/NNS
  ,/,
  and/CC
  fellow/JJ
  citizens/NNS
  :/:
  Today/VB
  our/PRP$
  nation/NN
  lost/VBD
  a/DT
  beloved/VBN
  ,/,
  graceful/JJ
  ,/,
  courageous/JJ
  woman/NN
  who/WP
  (Chunk called/VBD America/NNP)
  to/TO
  its/PRP$
  founding/NN
  ideals/NNS
  and/CC
  carried/VBD
  on/IN
  a/DT
  noble/JJ
  dream/NN
  ./.)
(S
  Tonight/NN
  we/PRP
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  hope/NN
  of/IN
  a/DT
  glad/JJ
  reunion/NN
  with/IN
  the/DT
  husband/NN
  who/WP
  was/VBD
  taken/VBN
  so/RB
  long/RB
  ago/RB
  ,/,
  and/CC
  we/PRP
  are/VBP
  grateful/JJ
  for/IN
  the/DT
  good/JJ
  life/NN
  of/IN
  (Chunk Coretta/NNP Scott/NNP King/NNP)
  ./.)
(S (/( (Chunk Applause/NNP) ./. )/))
In [30]:
def process_content_chink():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT>+{"""
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            #chunked.draw()
            print(chunked)
    except Exception as e:
        print(str(e))
In [31]:
process_content_chink()
(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP 'S/POS ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk
    THE/NNP
    UNION/NNP
    January/NNP
    31/CD
    ,/,
    2006/CD
    THE/NNP
    PRESIDENT/NNP
    :/:
    Thank/NNP
    you/PRP)
  all/DT
  (Chunk ./.))
(S
  (Chunk
    Mr./NNP
    Speaker/NNP
    ,/,
    Vice/NNP
    President/NNP
    Cheney/NNP
    ,/,
    members/NNS)
  of/IN
  (Chunk Congress/NNP ,/, members/NNS)
  of/IN
  the/DT
  (Chunk
    Supreme/NNP
    Court/NNP
    and/CC
    diplomatic/JJ
    corps/NN
    ,/,
    distinguished/JJ
    guests/NNS
    ,/,
    and/CC
    fellow/JJ
    citizens/NNS
    :/:)
  Today/VB
  (Chunk our/PRP$ nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk ,/, graceful/JJ ,/, courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN ./.))
(S
  (Chunk Tonight/NN we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  with/IN
  the/DT
  (Chunk husband/NN who/WP)
  was/VBD
  taken/VBN
  (Chunk so/RB long/RB ago/RB ,/, and/CC we/PRP)
  are/VBP
  (Chunk grateful/JJ)
  for/IN
  the/DT
  (Chunk good/JJ life/NN)
  of/IN
  (Chunk Coretta/NNP Scott/NNP King/NNP ./.))

Named Entity Recognition (NER)

In [32]:
def ner():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
           
            namedEnt = nltk.ne_chunk(tagged)
            #namedEnt.draw()
            print(namedEnt)
    except Exception as e:
        print(str(e))
In [33]:
ner()
(S
  PRESIDENT/NNP
  (PERSON GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (ORGANIZATION ADDRESS/NNP)
  BEFORE/IN
  A/NNP
  (ORGANIZATION JOINT/NNP)
  SESSION/NNP
  OF/IN
  (ORGANIZATION THE/NNP)
  (ORGANIZATION CONGRESS/NNP)
  ON/NNP
  THE/NNP
  (ORGANIZATION STATE/NNP OF/IN)
  (ORGANIZATION THE/NNP)
  (ORGANIZATION UNION/NNP)
  January/NNP
  31/CD
  ,/,
  2006/CD
  (ORGANIZATION THE/NNP)
  PRESIDENT/NNP
  :/:
  Thank/NNP
  you/PRP
  all/DT
  ./.)
(S
  (PERSON Mr./NNP Speaker/NNP)
  ,/,
  Vice/NNP
  President/NNP
  (PERSON Cheney/NNP)
  ,/,
  members/NNS
  of/IN
  (ORGANIZATION Congress/NNP)
  ,/,
  members/NNS
  of/IN
  the/DT
  (ORGANIZATION Supreme/NNP Court/NNP)
  and/CC
  diplomatic/JJ
  corps/NN
  ,/,
  distinguished/JJ
  guests/NNS
  ,/,
  and/CC
  fellow/JJ
  citizens/NNS
  :/:
  Today/VB
  our/PRP$
  nation/NN
  lost/VBD
  a/DT
  beloved/VBN
  ,/,
  graceful/JJ
  ,/,
  courageous/JJ
  woman/NN
  who/WP
  called/VBD
  (GPE America/NNP)
  to/TO
  its/PRP$
  founding/NN
  ideals/NNS
  and/CC
  carried/VBD
  on/IN
  a/DT
  noble/JJ
  dream/NN
  ./.)

Lemmatisation

Same as stemming but the end result is a real word ! instead of previously pythonli. By default when you lemmatise, the default pos is ‘n’ (noun) which sometimes is incorrect as shown below with ‘better’

In [34]:
from nltk.stem import WordNetLemmatizer
In [35]:
lemmatiser = WordNetLemmatizer()
In [39]:
lemmatiser.lemmatize('better')
Out[39]:
'better'
In [36]:
lemmatiser.lemmatize('better',pos='a')
Out[36]:
'good'
In [37]:
lemmatiser.lemmatize('best', pos='a')
Out[37]:
'best'
In [38]:
lemmatiser.lemmatize('run','v')
Out[38]:
'run'

Corpora

The directory of the nltk folder for this tutorial

We are trying to find the corpora directory – it’s in the nltk_data folder

In [40]:
nltk.__file__
Out[40]:
'/Users/ryanong/anaconda3/lib/python3.6/site-packages/nltk/__init__.py'

To access corpora

Through access a particular corpora, in this case gutenberg, you have multiple text files within gutenberg to choose from (this is shown in the nltk_data folder)

In [42]:
from nltk.corpus import gutenberg
In [43]:
sample = gutenberg.raw('bible-kjv.txt')
In [45]:
tok_sent = sent_tokenize(sample)
In [46]:
tok_sent[5:15]
Out[46]:
['1:5 And God called the light Day, and the darkness he called Night.',
 'And the evening and the morning were the first day.',
 '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.',
 '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.',
 '1:8 And God called the firmament Heaven.',
 'And the evening and the\nmorning were the second day.',
 '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.',
 '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.',
 '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.',
 '1:12 And the earth brought forth grass, and herb yielding seed after\nhis kind, and the tree yielding fruit, whose seed was in itself, after\nhis kind: and God saw that it was good.']

WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus.

In [47]:
from nltk.corpus import wordnet

Synonyms for the word ‘program’

In [49]:
syns = wordnet.synsets('program')
In [50]:
syns
Out[50]:
[Synset('plan.n.01'),
 Synset('program.n.02'),
 Synset('broadcast.n.02'),
 Synset('platform.n.02'),
 Synset('program.n.05'),
 Synset('course_of_study.n.01'),
 Synset('program.n.07'),
 Synset('program.n.08'),
 Synset('program.v.01'),
 Synset('program.v.02')]

Lemmas

In [51]:
syns[0].lemmas()
Out[51]:
[Lemma('plan.n.01.plan'),
 Lemma('plan.n.01.program'),
 Lemma('plan.n.01.programme')]
In [52]:
syns[0].lemmas()[0].name()
Out[52]:
'plan'

Definition of the word plan

In [55]:
syns[0].definition()
Out[55]:
'a series of steps to be carried out or goals to be accomplished'

Examples of ‘plan’

In [56]:
syns[0].examples()
Out[56]:
['they drew up a six-step plan', 'they discussed plans for a new bond issue']

Synonyms and antonyms

In [59]:
synonyms = []
antonyms = []
In [60]:
for syn in wordnet.synsets('good'):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
In [64]:
print(set(synonyms))
{'proficient', 'dependable', 'undecomposed', 'soundly', 'sound', 'estimable', 'upright', 'trade_good', 'skillful', 'skilful', 'full', 'secure', 'near', 'right', 'practiced', 'in_force', 'ripe', 'effective', 'expert', 'unspoiled', 'salutary', 'dear', 'in_effect', 'unspoilt', 'well', 'honorable', 'respectable', 'just', 'safe', 'commodity', 'honest', 'good', 'serious', 'goodness', 'thoroughly', 'adept', 'beneficial'}
In [65]:
print(set(antonyms))
{'badness', 'evil', 'bad', 'ill', 'evilness'}

Semantic similarity

In [66]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
In [69]:
w1.wup_similarity(w2)
Out[69]:
0.9090909090909091

Leave a Reply

Your email address will not be published. Required fields are marked *