Regular Expressions

Symbols Identifiers \d – any numbers \D – anything but a number \s – space \S – anything but a space \w – any character \W – anything but a character . – any character, except for a new line \b – whitespace around words . – a period Modifiers …

Natural Language Processing with NLTK – Part 2

Text Classification In [26]: import nltk import random from nltk.corpus import movie_reviews In [27]: documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] Shuffling the documents as it’s in ordered category In [28]: random.shuffle(documents) Lowercase all words and converting list to Frequency Distribution In [29]: all_words = [] In [30]: for w …

Natural Language Processing with NLTK – Part 1

Preparation In [1]: import nltk Download all the packages In [2]: #nltk.download() Tokenisation Two types Word tokenisers – separate by words Sentence tokenisers – separate by sentences Terminology Corpora – body of text (e.g. medical journals, presidential speeches) Lexicon – words and their meanings (e.g. investor-speak dictionary vs regular english-speak dictionary, i.e. …

Basic web scraping with BeautifulSoup4

Introduction BeautifulSoup is a python library for pulling data out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying the parse tree. This python library is useful for scraping websites, extracting informations. For example, you can use BeautifulSoup to extract reviews from Amazon, to gauge …

Introduction!

Objective Recently completed an internship at Mime Consulting, a data and technology consultancy start up that specialises in the education sector and was introduced to VBA and SQL. Currently, I am aiming to develop a strong technical skill set in Python (and machine learning) and one other programming languages (probably …