Stanford NLP: Regular Expressions, Tokenisation, Normalisation, Stemming and Sentence Segmentation

Regular Expressions

A formal language for specifying text strings. Note to use regexpal.com to practically learn about regular expression.

Disjunctions

  • Letters inside square brackets [] : e.g. [wW]oodchuck means it would match Woodchuck or woodchuck
  • Ranges : e.g [0-9], [A-Z], [a-z]

Negation in Disjunctions (^ means negation only when first in [])

  • [^A-Z] : not an upper case letter
  • [^e^] : not an e and not a carat
  • | (pipe symbol) : ‘or’, e.g. a | b | c = [abc]

Special characters

  • ? : previous character is optional, e.g. colou?r means the letter ‘u’ is optional
  • * : 0 or more of previous character, e.g. oo*h! means oh!, ooh!, oooh!
  • + : 1 or more of previous character, e.g. o+h! means oh!, ooh!, oooh!
  • . : match any characters, e.g. beg.n means begin, begun, beg3n

Anchors

  • ^[A-Z] (the carat sign outside the bracket) : matches the beginning of the line, e.g. Palo Alto
  • [A-Z]$ : matches the end of the line
  • \.$ : matches the actual period (fullstop) at the end of the line
  • .$ : matches any characters at the end of the line

Matching strings that we shouldn’t have matched – False positives (Type 1 error). Not matching strings that we should have matched – False negatives (Type 2 error). Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy or precision (minimising false positives) and Increasing coverage or recall (minimising false negatives).

Tokenisation

Every NLP task needs to do text normalisation, starting by:

  1. Segmenting/tokenising words in running text
  2. Normalise word formats
  3. Segmenting sentences in running text

How many words in a sentence?

“I do uh main- mainly business data processing”

‘main-‘ is a fragment and ‘uh’ is a filled pauseLemma is where two words have the same stem, POS and roughly same word sense, e.g. cat and cats is the same lemma. Wordform is the full inflected surface form.

“they lay back on the San Francisco grass and looked at the stars and their”

Type (vocabulary – set of types : size of vocabulary) : an element of the vocabulary (13 types or 12 or 11 – there are two ‘the’ and ‘they’ & ‘their’). Token (N – number of tokens) : an instance of that type in running text (15 or 14 tokens depending on if San Francisco).

Issues

When removing punctuations during the text normalisation, we might create unnecessary characters. For example, ‘Finland’s capital’ could become [Finland, s, capital] or ‘what’re’ could become [what, re]. Therefore we need to decide specific standards to deal with these problems depending on your objectives.

There are certain languages such as Mandarin that doesn’t have space in between characters (words). In order to deal with this problem, we can use Max-match segmentation where we split the text by matching the longest word in the dictionary. For example, ‘Thecatinthehat’, we will loop through each character starting with ‘T’ and check our dictionary to retrieve the words that start with T. As we move along the characters, words will start forming that matches the words in the dictionary, by which then we split the text and start the process again. However, this method generally doesn’t work in English as the example ‘Thetabledownthere’ would translate to ‘theta bled own there’ rather than ‘the table down there’.

Normalisation and Stemming

Normalisation means different things depending on the context. For example:

Information Retrieval (IR) : indexed text & query terms must have same form, for example, we want to match U.S.A. and USA. Therefore normalisation, in this context, means we implicitly define equivalence classes of terms (e.g. deleting periods in a term).

Asymmetric Expansion : where if we enter ‘window’, computer might search ‘window’ or ‘windows’. If we enter ‘windows’, computer might search ‘Windows’, ‘windows’ or ‘window’ etc…

In terms of applications like IR, we tend to reduce all letters to lower case, except upper case words in mid-sentence, for example, it is important to distinguish Fed vs fed. For sentiment analysis, this is extremely useful (it’s important to distinguish US vs us).

Lemmatisation

Reduce inflections or variant forms to base form. For example, ‘the boy’s cars are different colours’ would be reduce to ‘the boy car be different colour’. In general, lemmatisation is to find correct dictionary headword form for a given word form.

Morphology

Morphemes is the small meaningful units that make up words. Stems are the core meaning-bearing units and Affixes are the bits and pieces that adhere to stems (often with grammatical functions). For example, the word ‘Stems’, ‘Stem’ is the stem and ‘s’ is the affix.

Stemming

Therefore, stemming simply means reducing terms to their stems (by removing affixes) in IR. Stemming is a crude chopping of affixes and it’s language dependent. For example, automates, automatic and automation all reduced to automat. The problem with stemming is that it doesn’t always return a full word. The simplest and most commonly used algorithm is the Porter’s algorithm (English stemmer). The algorithm has predefined rules and the first step is to chop off any characters that matches those rules. For example, words ending with ‘sses’ will be chop to ending with ‘ss’. The algorithm also removes words that ends with ‘ing’ and ‘ed’. However, the algorithm only removes those if the words contain more than one vowels to avoid removing ‘ing’ from the word ‘sing’. ‘Sing’, in this case, only has one vowel whereas ‘Walking’ will turn into ‘Walk’. The second step is to do longer stemming, for example, turning ‘ational’ to ‘ate’ -> ‘relational’ to ‘relate’. The third step will involve even longer stemming.

Sentence Segmentation

Period “.” is quite ambiguous. It can mean sentence boundary, abbreviations and numbers. We can solve this problem by building a binary classifier, which it looks at a period “.” and classify whether the period it’s at EndofSentence/NotEndOfSentence. To make this classifier, we can use hand-written rules, regular expressions or machine learning. The simplest classifier for this is the decision tree.

You can develop a more sophisticated decision tree features. For example, you can look at the Case of word with/after “.” (Upper, Lower, Cap, Number). You can also look at the numeric features, for example, length of word with “.” or probability that a word with/after “.” occurs at end of sentence. A decision tree is just an if-then-else statement and the interesting research is choosing the features. In general, setting up the structure is often too hard to do by hand (only possible with simple features, domains). For numeric features, it’s too hard to pick each threshold. Therefore, we usually use machine learning to learn the structure from a training corpus.

Leave a Reply

Your email address will not be published. Required fields are marked *