程序代写代做代考 python Semester 2, 2021 – cscodehelp代写

Semester 2, 2021
Lecture 4, Part 4: Unstructured Data – Text Preprocessing

Pattern Matching in Text
Regular Expressions

Patterns in Text
• Scenario: we have a large collection of unstructured text data. You need to write wrangling code in order to
• Check whether it contains any IP addresses (e.g. 128.250.65.5) • Find all of the IP addresses
• Requirements
• Do it succinctly
• Do it unambiguously
• Have maintainable code
• Specify patterns in text – regular expressions
• Good for calculating statistics (count occurrences of items in text) • Checking for integrity, filtering, substitutions …

Regular Expressions (re)
Simple match – characters match themselves exactly • The pattern hello will match the string hello
• Hello will match Hello
RegExp/regex/re define patterns
• Metacharacters –special rules for matching

RE: metacharacters
.: matching any character
– Forexample,a.cmatchesa/c,abc,a5c – To match ‘.’ as a literal,
escape with ‘.’ a.c matches a.c That is, ‘’ is also a metacharacter

RE: metacharacters
: backslash character is used to
– escape metacharacters or other special characters, e.g.: – match ‘.’ as a literal: a.c matches a.c
– match ‘’ as a literal: a\c matches ac
It also indicate special forms, e.g., special character set d for any decimal digit

RE: metacharacters
[ ]: matching a set (class) of characters; example: [abc], [a-zA-Z]
• [^] : Complementing the set
add ‘^’ as the first character in the class ([^z]anything but z) What does the pattern [z^] match?
• Use ‘’ to escape special characters‘[’,‘]’inside []. [[]matches the special character: ‘[’

RE: metacharacters
[ ]: avoiding the use of square brackets
• Predefined special character set: e.g.,
• d any decimal digit == [0-9]
• w any alphanumeric character == [a-zA-Z0-9_]
• W any non-alphanumeric character == [^a-zA-Z0-9_]

RE: metacharacters
| “OR” operator
• Alternatives
• Given two patterns P1 and P2, P1 | P2 will match P1 or P2
• Create a regex that produces the same result without using ‘|’

RE: metacharacters
^ $: Anchoring
• ^ : start of string
• ^from will match from only at the start of the string, e.g. ‘from a to b’ • ^from will not match ‘I am from Melbourne’
• $ : end of string
• To match ‘^’or ‘$’as a literal
• Escape with ‘^’or ‘$’
• Put it in character class [$^](note special meaning if ‘^’ is placed as the first character)

RE: metacharacters
* + ? {m,n}: repeat a pattern • * : zero or more repetitions
• + : one or more repetitions
• ? : zero or once
• {m,n}: at least m and at most n repetitions.
• Pattern search is greedy

More complex regular expression
What do you think this pattern is for?
• As a task to do for the next live Zoom, please improve this pattern!

RE: metacharacters substitution & capturing groups
( ): as in math notation, group patterns with the metacharacters() • Grouped patterns are captured and numbered
• You can specify the contents by back-references

Re: metacharacters
The complete list of metacharacters
. ^ $ * + ? { } [ ] | ( )

ELIZA and Doctor
ELIZA: a computer psychotherapist
• I’m (depressed|sad|unhappy)
• Why do you say you are 1
• Is it because 1 that you come to me?
It is now called Doctor in emacs

Regular Expressions in Python
python re
import re
re.match()
re.search()
re.sub()
re.split()
p = re.compile(regular expression)
p.match()
Practice in the workshop

Leave a Reply

Your email address will not be published. Required fields are marked *