程序代写代做代考 python database COMP3430 / COMP8430 Data wrangling – cscodehelp代写

COMP3430 / COMP8430 Data wrangling
Interactive lecture week 9: Assignments, labs, and summary of week 9
(Lecturer: )

Lecture outline
● Administrative matters
– Assignments, Labs, and Exam
● Questions on Wattle ● Summary of week 9 ● Q and A Session

Labs
● We had lab 6 this week, continuation of the record linkage project
● Different evaluation methods are discussed
● We have now completed the record linkage process
● Sample solutions for lab will be released on Monday
● For assignment 3 you can now run the entire Record Linkage programme

Labs next week
● We will give you another list of data sets for conducting experiments
● Focus on experimenting with different attribute combinations, parameter settings, and data set pairs. Try to understand how these aspects affect the final record linkage quality
● Next week lab is the final lab session. You have a chance to ask questions about Assignment 3

Assignment 1
● Assignment 1, all remark requests have been answered ● Your marks are now finalised and fixed
● We have identified two marking issues
– Different MAD calculations in R and Python. R uses a scaling factor of 1.4826 by default when calculating MAD
– Cramér’s V calculation for correlation between categorical variables

Assignments 2 and 3
● Assignment 2: deadline is today, 8th October 2021, 11:55 PM
● Make sure you submit your assignments before the deadline
● Assignment 3: due on 22nd October 2021, 11:55 PM

Assignment 4
● Assignment 4 is for all the students who enrolled in COMP8430.
● Assignment 4: due on 29th October 2021, 11:55 PM

Final examination
● Final exam will be on Monday 15th November in the afternoon from 5.40 PM AEDT
● For details see ANU exam timetable
● We recommend you to have Python installed and Virtual
box or VMWare Horizon setup in your machines
● We will discuss final exam in the last interactive lecture

Questions from Wattle forum
● Assume you need to deduplicate a database that contains 10,000 records. You apply a blocking technique and a total of 1,250,000 candidate record pairs are generated by your blocking technique. What is the reduction ratio of this blocking technique on this database?

Questions from Wattle forum
● Number of naive pairwise comparisons = (10000*9999)/2 = 49995000
● RR = 1 – (num_pairs_after_blocking / total_num_pairs) = 1 – (1250000/49995000)
= 0.97499 = 0.975

Summary of week 9
● Data Fusion
– Difficulties with data fusion: Missing values, contradicting attribute values, Uncertainty, Implementation into database – Different types of records needs to be fused: identical records, subsumed records, complementing records, conflicting records
– Different conflicting types, resolution strategies, resolution functions, operations we can use in data fusion

Summary of week 9
● Advanced record linkage techniques – Group based record linkage
– Collective linkage techniques
– Active learning
– Geocode matching
– Linking temporal and dynamic data

Summary of week 9
● Privacy-preserving record linkage (PPRL) techniques
– Difference between privacy, confidentiality, and security – Why privacy needs to be protected
– Need to preserve the privacy and confidentiality of entities represented by data during the DW process
– PPRL techniques: Secure hash encoding, noise addition, generalisation, secure multi-party computation, perturbation

Q and A Session
● Socrative
– https://b.socrative.com/login/student/ – Room Name: COMP3430

Leave a Reply

Your email address will not be published. Required fields are marked *