程序代写代做代考 Excel python case study chain Introduction to Regular Expressions

Introduction to Regular Expressions

Introduction to Regular Expressions

Faculty of Information Technology
Monash University

FIT5196 week 2

(Monash) FIT5196 1 / 23

Regular Expressions

A regular expression is a set of symbols that describes a text patten.
É d{4}-d{2}-d{2}
É wrangling

Why regular expressions?
É Regular expressions are useful in finding, replacing and extracting information
from text, such as log files, HTML/XML files, and other documents

− Search a document for color or neighbor with or without ’u’
− Covert a tab-delimited file to a comma-delimited file
− Find duplicated words in a text
− Search and replace “Bob” and “Bobby” with “Robert”

É Regular expressions are useful in verifying whether input fits into the text
pattern, such as verifying

− phone numbers: Does a phone number have the right number of digits?
− emails: Is an email address in a valid format?
− date: Is a date in the right format? Does the month exceed 12?

(Monash) FIT5196 2 / 23

Regular Expressions: validate emails

r”(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$)”1

7/28/2016 image (10).svg

file:///Users/dulan/Downloads/image%20(10).svg 1/1

group #1

Start of line End of line

One of:

­“a” “z”

­“A” “Z”

­“0” “9”

“_”

“.”

“+”

“­”

“@”

One of:

­“a” “z”

­“A” “Z”

­“0” “9”

“­”

“.”

One of:

­“a” “z”

­“A” “Z”

­“0” “9”

“­”

“.”

Figure: Figure generated by https://regexper.com/

1http://emailregex.com/
(Monash) FIT5196 3 / 23

Regular Expressions: validate emails

r”(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$)”17/28/2016 image (10).svg

file:///Users/dulan/Downloads/image%20(10).svg 1/1

group #1

Start of line End of line

One of:

­“a” “z”

­“A” “Z”

­“0” “9”

“_”

“.”

“+”

“­”

“@”

One of:

­“a” “z”

­“A” “Z”

­“0” “9”

“­”

“.”

One of:

­“a” “z”

­“A” “Z”

­“0” “9”

“­”

“.”

Figure: Figure generated by https://regexper.com/

1http://emailregex.com/
(Monash) FIT5196 3 / 23

Outline

1 Regular Expression Syntax
character sets
repetition
grouping
raw string in Python

2 Cases studies

3 Summary

(Monash) FIT5196 4 / 23

Regular Expression Syntax

Matching String Literals

The most obvious feature of regular expressions is matching strings with one or
more literal characters, called string literals.

Everything is essentially a character in regular expressions.
É cat matches “cat”.
É cat matches the first three characters of “cattle” and “catfish”.

It is similar to searching in word processing program
Matching is case-sensitive:
É cat does not match “Cat”.

How does regular expression engine work?

cat
The cow, camel and cat communicated.

(Monash) FIT5196 5 / 23

Regular Expression Syntax character sets

Character sets: [ . . . ]

Assume that we are going to match the following two words:

grey gray

What the regular expression should be?

(Monash) FIT5196 6 / 23

Regular Expression Syntax character sets

Character sets: [ . . . ]

Assume that we are going to match the following two words:

grey gray

What the regular expression should be?
[ . . . ] indicate a set of characters
É Matches any one of several characters in the set, but only one
É The order of characters does not matters.

(Monash) FIT5196 6 / 23

Regular Expression Syntax character sets

Character sets: [ . . . ]

Assume that we are going to match the following two words:

grey gray

What the regular expression should be?
[ . . . ] indicate a set of characters
É Matches any one of several characters in the set, but only one
É The order of characters does not matters.
É The regular expression is gr[ea]y

“gr”

One of:

“e”

“a”
“y”

É gr[ea]y does not match grAy, graay, and graey.

(Monash) FIT5196 6 / 23

Regular Expression Syntax character sets

Character ranges: [a − zA− Z ] and [0− 9]

Assume that we are going to match victory car plate numbers, for example

XRA 000, 1AA 1AA

Note the letters can be from A to Z, and the numbers can be from 0 to 9. What
the regular expression should be?

(Monash) FIT5196 7 / 23

Regular Expression Syntax character sets

Character ranges: [a − zA− Z ] and [0− 9]

Assume that we are going to match victory car plate numbers, for example

XRA 000, 1AA 1AA

Note the letters can be from A to Z, and the numbers can be from 0 to 9. What
the regular expression should be?

Character ranges can be indicated
by giving two characters and
separating them by a ‘-‘.
Example:
É [0− 9]
É [a − z ] or [A− Z ]

Caution
É [50− 99] is not all numbers from
50 to 99, it is the same as [0−9].

7/22/2016 image (1).svg

file:///Users/dulan/Downloads/image%20(1).svg 1/1

One of:

“5”

­“0” “9”

“9”

(Monash) FIT5196 7 / 23

Regular Expression Syntax character sets

Character ranges: [a − zA− Z ] and [0− 9]

Assume that we are going to match victory car plate numbers, for example

XRA 000, 1AA 1AA

Note the letters can be from A to Z, and the numbers can be from 0 to 9. What
the regular expression should be?

[A-Z0-9][A-Z][A-Z]s[0-9][A-Z0-9][A-Z0-9]
8/1/2016 image (1).svg

file:///Users/land/Downloads/image%20(1).svg 1/1

One of:

­“A” “Z”

­“0” “9”

One of:

­“A” “Z”

One of:

­“A” “Z” white space

One of:

­“0” “9”

One of:

­“A” “Z”

­“0” “9”

One of:

­“A” “Z”

­“0” “9”

(Monash) FIT5196 7 / 23

Regular Expression Syntax character sets

Negative character sets: [ˆ . . . ]

Assume that we are going write a regular expression that matches only the live
animals

hog dog bog

Question: what is the regular expression?

(Monash) FIT5196 8 / 23

Regular Expression Syntax character sets

Negative character sets: [ˆ . . . ]

Assume that we are going write a regular expression that matches only the live
animals

hog dog bog

Question: what is the regular expression?
[ˆ . . . ]: If the first character of the set is ,̂ all the characters that are not in
the set will be matched.
É [ˆb]og matches “hog” and “dog”, but not “bog”.
É Caution:

− Does see[^mn] match “see”?
− Does see[^mn] match “see ”?

Try the regular expression in Pythex (http://pythex.org/)!

(Monash) FIT5196 8 / 23

Regular Expression Syntax character sets

Metacharacters inside character sets: [.+]

Assume that we are going to match the following two strings:

var(9), var[0]

Now, we need to match () and [ ]. How can we do that?

(Monash) FIT5196 9 / 23

Regular Expression Syntax character sets

Metacharacters inside character sets: [.+]

Assume that we are going to match the following two strings:

var(9), var[0]

Now, we need to match () and [ ]. How can we do that?
Metacharacters inside character sets are already escaped. In other words
they lose their special meaning inside sets.
É Example:

− h[ai.u]t matches “hat”, “h.t”, but not “hot”

Exceptions: ], -, ˆ and hat do need to be escaped.
É h[ai.u]t → h[ai]u]t?

(Monash) FIT5196 9 / 23

Regular Expression Syntax character sets

Metacharacters inside character sets: [.+]

Assume that we are going to match the following two strings:

var(9), var[0]

Now, we need to match () and [ ]. How can we do that?
Metacharacters inside character sets are already escaped. In other words
they lose their special meaning inside sets.
É Example:

− h[ai.u]t matches “hat”, “h.t”, but not “hot”

Exceptions: ], -, ˆ and hat do need to be escaped.
É h[ai.u]t → h[ai]u]t?

var[([][0-9][)]]
7/22/2016 image (3).svg

file:///Users/dulan/Downloads/image%20(3).svg 1/1

“var”

One of:

“(”

“[”

One of:

­“0” “9”

One of:

“)”

“]”

(Monash) FIT5196 9 / 23

Regular Expression Syntax character sets

Metacharacters inside character sets: [.+]

Assume that we are going to match the following two strings:

var(9), var[0]

Now, we need to match () and [ ]. How can we do that?
var[([][0-9][)]]

7/22/2016 image (3).svg

file:///Users/dulan/Downloads/image%20(3).svg 1/1

“var”

One of:

“(”

“[”

One of:

­“0” “9”

One of:

“)”

“]”

var[([][0-9][)]]
7/22/2016 image (4).svg

file:///Users/dulan/Downloads/image%20(4).svg 1/1

“var”

One of:

“(”

“[”

One of:

­“0” “9”

One of:

“)” “]”

(Monash) FIT5196 9 / 23

Regular Expression Syntax character sets

Shorthand character sets

Shorthand meaning Equivalent
d matches any decimal digit from 0 to 9 [0-9]
w matches any word character [a-zA-Z0-9_]
s matches any white space character [

]
D matches any non-digit character; [^0-9]
W matches any non-alphanumeric character [^a-zA-Z0-9_]
S matches any non-whitespace character [^

]

(Monash) FIT5196 10 / 23

Regular Expression Syntax character sets

Shorthand character sets

Shorthand meaning Equivalent
d matches any decimal digit from 0 to 9 [0-9]
w matches any word character [a-zA-Z0-9_]
s matches any white space character [

]
D matches any non-digit character; [^0-9]
W matches any non-alphanumeric character [^a-zA-Z0-9_]
S matches any non-whitespace character [^

]

Examples:
É dddd matches four-digit numbers, such as “2018”, but not text.
É www matches three word characters, such as “abc”, “123” and “d_b”
É wwsw matches “ab c” but not “a bc”.
É [w]-[w] matches two characters separated by a hyphen.
É [^d] is the same as [D]

(Monash) FIT5196 10 / 23

Regular Expression Syntax character sets

Shorthand character sets

Caution:
É Is [^ds] the same as [DS]?

(Monash) FIT5196 11 / 23

Regular Expression Syntax character sets

Shorthand character sets

Caution:
É Is [^ds] the same as [DS]?

− [^ds]: Not digit OR space character
7/23/2016 image.svg

file:///Users/dulan/Downloads/image.svg 1/1

None of:

digit

white space

− [DS]: EITHER NOT digit OR NOT space character
7/23/2016 image (1).svg

file:///Users/dulan/Downloads/image%20(1).svg 1/1

One of:

non­digit

non­white space

Try the regular expression with the following sentence: “Data Wrangling S2
2018 week 2”

(Monash) FIT5196 11 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

Examples:
É Assume we are going to match the following words

oops ooops ooooops oooooops

but not
ops

which regular expression(s) should we use?
1 oo*ps
2 ooo*ps
3 oo+ps
4 oo?ps

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

Examples:
É Assume we are going to match the following words

oops ooops ooooops oooooops

but not
ops

which regular expression(s) should we use?

1 oo*ps

2 ooo*ps

3 oo+ps

4 oo?ps

7/27/2016 image (2).svg

file:///Users/dulan/Downloads/image%20(2).svg 1/1

“o” “o” “ps”

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

Examples:
É Assume we are going to match the following words

oops ooops ooooops oooooops

but not
ops

which regular expression(s) should we use?

1 oo*ps

2 ooo*ps

3 oo+ps

4 oo?ps

7/27/2016 image (3).svg

file:///Users/dulan/Downloads/image%20(3).svg 1/1

“oo” “o” “ps”

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

Examples:
É Assume we are going to match the following words

oops ooops ooooops oooooops

but not
ops

which regular expression(s) should we use?

1 oo*ps

2 ooo*ps

3 oo+ps

4 oo?ps

7/27/2016 image (4).svg

file:///Users/dulan/Downloads/image%20(4).svg 1/1

“o” “o” “ps”

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

Examples:
É Assume we are going to match the following words

oops ooops ooooops oooooops

but not
ops

which regular expression(s) should we use?

1 oo*ps

2 ooo*ps

3 oo+ps

4 oo?ps

7/27/2016 image (5).svg

file:///Users/dulan/Downloads/image%20(5).svg 1/1

“o” “o” “ps”

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: repetition meta-characters

meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex

Examples:
É Assume we are going to match the following words

oops ooops ooooops oooooops

but not
ops

The regular expressions that we can use:

ooo*ps oo+ps

Try the regular expression in Pythex!

(Monash) FIT5196 12 / 23

Regular Expression Syntax repetition

Repetition Expressions: quantified repetitions

{m, n}: matches exactly from m to n repetitions of the preceding regular
expression.
É m (min) and n (max) are positive numbers
É m must be always be included, can be 0
É n is optional

Three syntax
É d{2} matches numbers with exactly 2 digits.
É d{2, 4} matches numbers with 2 to 4 digits.
É d{2, } matches numbers with at least 2 digits (n is infinite).

Try the “oops” example in Pythex , but with {m, n}

(Monash) FIT5196 13 / 23

Regular Expression Syntax repetition

Repetition Expressions: quantified repetitions

Suppose we are going to match the following
report_2018_09 assignment_2018_9
budget_18_08 assignment_18_7

but not
report_201809_39 assignment_8_9000
budget_2345678_08 assignment_000999_7

what is the regular expression?

(Monash) FIT5196 14 / 23

Regular Expression Syntax repetition

Repetition Expressions: quantified repetitions

Suppose we are going to match the following
report_2018_09 assignment_2018_9
budget_18_08 assignment_18_7

but not
report_201809_39 assignment_8_9000
budget_2345678_08 assignment_000999_7

what is the regular expression?

w+_d{2,4}_d{1,2}7/27/2016 image (6).svg

file:///Users/dulan/Downloads/image%20(6).svg 1/1

word “_” digit

1…3 times

“_” digit

at most once

Try the regular expression in Pythex

(Monash) FIT5196 14 / 23

Regular Expression Syntax repetition

Repetition Expressions: greedy v.s. lazy regex

Greedy strategy:
É Match as much as possible before giving control to the next regular
expression part.

− Regular expressions try to match the longest possible string
É Examples

.*d+
number 516

(Monash) FIT5196 15 / 23

Regular Expression Syntax repetition

Repetition Expressions: greedy v.s. lazy regex

Greedy strategy:
É Match as much as possible before giving control to the next regular
expression part.

− Regular expressions try to match the longest possible string
É Examples

.*d+
number 516

É Question: Given a string like

“data”, “wrangling”, “FIT5196, S2.”

What is the match of regular expression “.+”, “.+”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”

(Monash) FIT5196 15 / 23

Regular Expression Syntax repetition

Repetition Expressions: greedy v.s. lazy regex

Greedy strategy:
É Match as much as possible before giving control to the next regular
expression part.

− Regular expressions try to match the longest possible string
É Examples

.*d+
number 516

É Question: Given a string like

“data”, “wrangling”, “FIT5196, S2.”

what is the match of regular expression “.+”, “.+”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”

(Monash) FIT5196 15 / 23

Regular Expression Syntax repetition

Repetition Expressions: greedy v.s. lazy regex

Lazy strategy:
É Match as little as possible before giving control to the next regular expression
part

É Syntax
− *?
− +?
− ??
− {m,n}?

É Example:
.*?d+

number 516

(Monash) FIT5196 16 / 23

Regular Expression Syntax repetition

Repetition Expressions: greedy v.s. lazy regex

Lazy strategy:
É Match as little as possible before giving control to the next regular expression
part

É Syntax
− *?
− +?
− ??
− {m,n}?

É Example:
.*?d+

number 516
Question: Given a string like

“data”, “wrangling”, “FIT5196, S2.”

what is the match of regular expression “.+?”, “.+?”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”

(Monash) FIT5196 16 / 23

Regular Expression Syntax repetition

Repetition Expressions: greedy v.s. lazy regex

Lazy strategy:
É Match as little as possible before giving control to the next regular expression
part

É Syntax
− *?
− +?
− ??
− {m,n}?

É Example:
.*?d+

number 516
Question: Given a string like

“data”, “wrangling”, “FIT5196, S2.”

what is the match of regular expression “.+?”, “.+?”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”

(Monash) FIT5196 16 / 23

Regular Expression Syntax grouping

Grouping: (. . . )

(. . . ) matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group.
É Apply repetition operators to a group of regular expressions
É Makes regular expressions easier to read
É Capture groups for use in matching, replacing and extraction, i.e., the
contents of a group can be retrieved.

É Cannot be used insides a character set.

examples:

É abc+ matches abc, abcc, abcccc7/27/2016 image (7).svg

file:///Users/dulan/Downloads/image%20(7).svg 1/1

“ab” “c”

É (abc)+ matches abc, abcabc,
abcabcabc

7/27/2016 image (8).svg

file:///Users/dulan/Downloads/image%20(8).svg 1/1

group #1

“abc”

(Monash) FIT5196 17 / 23

Regular Expression Syntax grouping

Grouping: (. . . )

(. . . ) matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group.
É Apply repetition operators to a group of regular expressions
É Makes regular expressions easier to read
É Capture groups for use in matching, replacing and extraction, i.e., the
contents of a group can be retrieved.

É Cannot be used insides a character set.

examples:
É “Incident American Airlines Flight 11 involving a Boeing 767-223ER in 2001″
É Regular expression: Incident (.*) involving7/27/2016 image (9).svg

file:///Users/dulan/Downloads/image%20(9).svg 1/1

“Incident ”

group #1

any character “ involving”

É Try it with python script!!!

(Monash) FIT5196 17 / 23

Regular Expression Syntax grouping

Alternation: |

“|” is an OR operator
É A|B will match any string that matches either A or B
É Ordered: leftmost expression gets precedence.
É Multiple patterns can be daisy-chained.
É Group alternation expressions to keep them distinct.

Examples:
É apple|orange matches “apple” and “orange”
É (apple|orange) juice matches “apple juice” and “orange juice”
É w(ei|ie)rd matches both “weird” and “wierd”.

(Monash) FIT5196 18 / 23

Regular Expression Syntax raw string in Python

The backslash plague:

The back slash indicates special forms or to allow special characters to be
used without invoking their special meaning.2

Characters Stage
section text string to be matched
\section Escaped backslash for re.compile()

\\section Escaped backslashes for a Python string literal

So, to match a literal backslash, one has to write ’\\’ as the regular
expression string

Can we simply the expression?

2see https://docs.python.org/3/howto/regex.html
(Monash) FIT5196 19 / 23

Regular Expression Syntax raw string in Python

Raw String: r”. . . ”

Raw String suppress actual meaning of escape characters, and do not treat
the backslash as a special character at all.

Regular Python string literal Raw string
“\\section” r”\section”
“\w+\s+” r”w+s+”

Regular expressions will often be written in Python code using this raw
string notation.

Try the Python script!!!

(Monash) FIT5196 20 / 23

Cases studies

Case study 1: validate dates

Date samples (day, month, year):
É 02/08/2018
É 2/8/2018
É 2/8/18
É 23/08/2018
É 23-08-2018

See jupyter notebook!!!

(Monash) FIT5196 21 / 23

Cases studies

Case study 2: validate credit card number

Assume that you’re given the job of implementing an order form for a
company that accepts payment by credit card issued by the world’s major
credit card companies, such as VISA, Master, and American Express.
É All Visa card numbers start with a 4. New cards have 16 digits. Old cards
have 13.

− 4123456789012
− 4123456789012345

É MasterCard numbers either start with the numbers 51 through 55. All have
16 digits.

− 5123456789012345
− 5523456700012345

É American Express card numbers start with 34 or 37 and have 15 digits.
− 341234567890123
− 371234567890123

See jupyter notebook!!!

(Monash) FIT5196 22 / 23

Summary

Summary: what to do this week

Regular expressions are the major tool used in data parsing.

Study materials provided in Moodle.
Attend tutorial 2.
É Try to finish all the materials provided in the tutorial.

Assessment 1 will be released at the end of this week (week 2).
Topic for next week:
É Parsing data stored in different file formats, CSV, JSON, XML, EXCEL, and
PDF

(Monash) FIT5196 23 / 23

Regular Expression Syntax
character sets
repetition
grouping
raw string in Python

Cases studies
Summary

Leave a Reply

Your email address will not be published. Required fields are marked *