程序代做CS代考 python database algorithm Semester 2, 2021 – cscodehelp代写

Semester 2, 2021
Lecture 4, Part 5: Unstructured Data – Preprocessing

Text Preprocessing – Tokenisation
• Granularity of a token • Sentence
• Word
• Token separators
• “The speaker did her Ph.D. in Germany. She now works at UniMelb.” • “The issue—and there are many—is that text is not consistent.”

Text Preprocessing – Tokenisation
• Split continuous text into a list of individual tokens
• English words are often separated by white spaces but not always • Tokens can be words, numbers, hashtags, etc.
• Can use regular expression

Text Preprocessing – Case folding
• Convert text to consistent cases
• Simple and effective for many tasks
• Reduce sparsity (many map to the same lower-case form) • Good for search
I had an AMAZING trip to Italy, Coffee is only 2 bucks, sometimes three!
i had an amazing trip to italy, coffee is only 2 bucks, sometimes three!

Preprocessing – Stemming
• Words in English are derived from a root or stem inexpensive → in+expense+ive
• Stemming attempts to undo the processes that lead to word formation • Remove and replace word suffixes to arrive at a common root form
• Result does not necessarily look like a proper ‘word’
• Porter stemmer: one of the most widely used stemming algorithms
• suffix stripping (Porter stemmer) • sses → ss
• ies → i
• tional → tion • tion → t

Preprocessing – Lemmatization
• To remove inflections and map a word to its proper root form (lemma)
• It does not just strip suffixes, it transforms words to valid roots: running → run
runs → run ran → run
• Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.

Stop Word Removal
• Stop words are ‘function’ words that structure sentences; they are low information words and some of them are very common
• ‘the’, ‘a’, ‘is’,…
• Exclude them from being processed; helps to reduce the number of
features/words
• Commonly applied for search, text classification, topic modelling, topic extraction, etc.
• A stopword list can be custom-made for a specific context/domain

Stop Word Removal

Text Normalisation
• Transforming a text into a canonical (standard) form
• Important for noisy text, e.g., social media comments, text messages
• Used when there are many abbreviations, misspellings and out-of- vocabulary words (oov)
• E.g.
2moro → tomorrow
2mrw → tomorrow tomrw → tomorrow B4 → before

Noise Removal
• Remove unnecessary spacing
• Remove punctuation and special characters (regular expressions) • Unify numbers
• Highly domain dependent

So far… Unstructured Text Data
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
• Document representation and text features (BoW, TF-IDF) • Crawling & scraping

Leave a Reply

Your email address will not be published. Required fields are marked *