代写代考 ECE 219 Large-Scale Data Mining: Models and Algorithms – cscodehelp代写
ECE 219 Large-Scale Data Mining: Models and Algorithms
Project 1: End-to-End Pipeline to Classify News Articles
Statistical classification broadly refers to the task of learning to identify a subset of categories that pertain to a data point (sample of text, an image, a video clip, a time-signal etc..) from a predefined (generally human-guided) larger set of categories. The model attempts to master the task given a training data set (that is kept separate from an evaluation set) in which each data point is pre-labeled with their “correct” category membership/s. In this project, we deal with the classification of text data. The project consists of building an end-to-end pipeline to classify samples of news articles and involves the following ML components:
1. Feature Extraction: Construction of TF-IDF representations of textual data;
2. Dimensionality Reduction: Principal Component Analysis (PCA) and non-Negative
Matrix Factorization (NMF) – Generally necessary for classical ML methods
3. Application of Simple Classification Models: Applying common classification meth- ods to the extracted features such as Logistic/Linear Classification and Support Vector Machines;
4. Evaluation the Pipeline: Evaluating and diagnosing classification results using Grid- Search and Cross Validation;
5. Replace corpus-level features with pretrained features: To apply pre-training to a downstream classification task and evaluate comparative performance.
Getting familiar with the dataset
Please access the dataset at this link. We are using a custom dataset that was designed specifically for this quarter 1. Note: Do not attempt to open the downloaded file in Excel – the formatting of file when visualized in Excel might suggest the data is corrupted, but this is not true. Consider exploring the dataset using Pandas. You might find the following Pandas functions helpful (in no specific order): read csv, head, hist, shape.
QUESTION 1: Provide answers to the following questions:
• Overview: How many rows (samples) and columns (features) are present in the dataset? We
got about 2000 samples.
1This dataset was extracted from the search feature in The GDELT Project and a recursive crawler that traverses resulting news links
• Histograms: Plot 3 histograms on : (a) The total number of alpha-numeric characters per data point (row) in the feature full text: i.e count on the x-axis and frequency on the y-axis; (b) The column leaf label – class on the x-axis; (c) The column root label – class on the x-axis.
• Interpret Plots: Provide qualitative interpretations of the histograms.
The two sets of labels leaf label and root label are hierarchically arranged as follows:
root label sports climate
leaflabel chess cricket soccer football %22forest%20fire%22 flood earthquake drought
Binary Classification
For the first part of the project, we will be using only the full text column as the raw features per sample (row) and the root label column as the label for each sample. The root labels are well-separated. Before continuing on, please set the random seed as follows to ensure consistency:
import numpy as np
import random
np.random.seed(42)
random.seed(42)
1 Splitting the entire dataset into training and testing data
In order to measure the performance of our binary classification model, we split the dataset into a training and a testing set. The model is trained on the training set and evaluated on the testing set. Note: Do not train on the testing set. We create the sets with a Pandas dataframe input that contains the entire dataset df. Please make sure that the random seeds are set and the fraction of the test set is 0.2:
train and test contain the dataframes containing specific rows pertaining to the training data and testing data respectively.
QUESTION 2: Report the number of training and testing samples.
Hint: You should have about 1650 samples in the training set and about 400 samples in the testing set.
2 Feature Extraction
The primary step in classifying a corpus of text is choosing a good representation of each data point. Since the full text column contains the raw features to describe each data point, we seek a feature extraction module that encodes raw text features into processed computationally compatible features.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df[[“full_text”,”root_label”]], test_size=0.2)
A good representation should retain enough class-discriminating information post-processing to competitively perform classification, yet in the meantime, be concise to avoid computational intractability and over fitting.
A first model: The Bag-of-Words (BOW): One feature extraction technique is the “Bag of Words” model, where a document – in this case the full text feature of one data point – is represented as a histogram of word frequencies, or other statistics, within a fixed vocabulary of words. The vocabulary is aggregated from the training set (process described below).
Compiling the Vocabulary: (a) Each raw feature (text segment) is split into sentences; (b) Each sentence is split into words; (c) The list of words PER sentence are passed jointly into a position tagger to identify the nouns, verbs, adjectives etc.; (d) These tags are used to lemmatize/stem 2 each word; (e) Words are filtered: Very rare or very frequent words (the number of documents they occur in, or the number of times they occur within a document) are removed, digits and punctuation-dominant words are removed, and stopwords, words contained in a database of common words are removed; (f) Remaining words are added to the vocabulary. Say that the selected set of words that compose the vocabulary form a set W. For a dataset D, the processed features can be collectively represented as a data matrix X ∈ R|D×|W|. So each row captures the count of each word (the histogram) per data point.
An Example: If a test data point’s raw feature contains the text,
“On Saturday, the NHL hockey team went to the school to volunteer and educate.
Outreach is required for hockey, which is dying in popularity in recent years.”
and W = [“hockey′′,“volunteer′′,“sport′′], the row in X corresponding to this data point would be [2, 1, 0] because “hockey” appears twice, “volunteer” appears once, and “sport” does not appear at all (though it might appear in another data point. Remember the vocabulary is aggregated across the training set.)
During Testing: Each text sample in the testing set is similarly processed to those during training; however words are no longer added to W – this was fixed during training. Instead if a word that exists in the vocabulary occurs in the processed words from a testing sample, its count is incremented in the resulting feature matrix X.
To avoid adding to the vocabulary during testing and to avoid training on the test set in general please note as a rule: For most of this project you will be using the NLTK and sci-kit learn (sklearn) libraries for text processing and classifier design. In sklearn, in particular, only use the functions fit transform and fit on the training set and only use transform on the testing set.
The better model: The Term Frequency-Inverse Document Frequency Model (TF- IDF): “document” and “data point” are used interchangeably While Bag-of-Words model con- tinues to be used in many feature extraction pipelines, a normalized count vector that not only counts the number of times a word occurs in a document (the term frequency) but also scales the resulting count by the number of documents the word appears in (the document frequency) might provide a more valuable feature extraction approach.
The focus on the document frequency (and more correctly the inverse of it) encourages the
2Lemmatization: Lemmatization uses a prelearned vocabulary and morphological information of words in-sentence context, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Stemming: Stemming is a crude heuristic that removes the rightmost characters of words in the hope of mapping multiple word derivatives to the same root word.
If confronted with the word “saw”, stemming might return just “s”, whereas lemmatization would attempt to return either “see” or “saw” depending on whether the use of the word was as a verb or a noun.
feature vector to discriminate between documents as well as represent a document. TF-IDF does not only ask “What is the frequency of different words in a document?” but rather “What is the frequency of words in a document specific to that document and which differentiates it from other documents? A human reading a particular news article in the sports section will usually ignore the contextually dominant words such as “sport”, “competition”, and “player” despite these words being frequent in every news article in the sports section.
Such a context-based conditioning of information is widely observed. The human perception system usually applies a saturating function (such as a logarithm or square-root) to the actual input values into the vision model before passing it on to the neuronal network in the brain. This makes sure that a contextually dominant signal does not overwhelm the decision-making processes in the brain. The TF-IDF functions draw their inspiration from such neuronal sys- tems. Here we define the TF-IDF score to be:
TF-IDF(d, t) = TF(t, d) × IDF(t)
where TF(d,t) represents the frequency of word (processed, lemmatized, otherwise filtered) t in document d, and inverse document frequency is defined as:
n IDF(t) = log DF(t) + 1
where n is the total number of documents, and df(t) is the document frequency, i.e. the number of documents that contain the word t.
def clean(text):
text = re.sub(r’^https?://.*[
]*’, ”, text, flags=re.MULTILINE)
texter = re.sub(r”
“, ” “, text)
texter = re.sub(r”"”, “””,texter)
texter = re.sub(‘'’, “””, texter)
texter = re.sub(‘
’, ” “, texter)
texter = re.sub(‘ u ‘,” you “, texter)
texter = re.sub(‘`’,””, texter)
texter = re.sub(‘ +’, ‘ ‘, texter)
texter = re.sub(r”(!)1+”, r”!”, texter)
texter = re.sub(r”(?)1+”, r”?”, texter)
texter = re.sub(‘&’, ‘and’, texter)
texter = re.sub(‘
’, ‘ ‘,texter)
clean = re.compile(‘<.*?>‘)
texter = texter.encode(‘ascii’, ‘ignore’).decode(‘ascii’)
texter = re.sub(clean, ”, texter)
if texter == “”:
texter = “”
return texter
QUESTION 3: Use the following specs to extract features from the textual data:
• Before doing anything, please clean each data sample using the code block provided above. This function helps remove many but not all HTML artefacts from the crawler’s output. You can also build your own cleaning module if you find this function to be ineffective.
• Use the “english” stopwords of the CountVectorizer 4
• Exclude terms that are numbers (e.g. “123”, “-45”, “6.7” etc.)
• Perform lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag • Use min df=3
Please answer the following questions:
• What are the pros and cons of lemmatization versus stemming? How do these processes affect the dictionary size?
• min df means minimum document frequency. How does varying min df change the TF-IDF matrix?
• Should I remove stopwords before or after lemmatizing? Should I remove punctuations before or after lemmatizing? Should I remove numbers before or after lemmatizing?
• Report the shape of the TF-IDF-processed train and test matrices. The number of rows should match the results of Question 2. The number of columns should roughly be in the order of k×103. This dimension will vary depending on your exact method of cleaning and lemmatizing and that is okay.
The following functions in sklearn will be useful: CountVectorizer, TfidfTransformer, About Lemmatization and for the daring, Pipeline. Please refer to the discussion section notebooks for more guidance.
3 Dimensionality Reduction
After applying the above operations, the dimensionality of the representation vectors (TF-IDF vectors) is large. Classical learning algorithms, like the ones required in this section, however, may perform poorly with such high-dimensional data. Since the TF-IDF matrix is sparse and low-rank, as a remedy, one can project the points from the larger dimensional space to a lower dimension.
In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI) and Non-negative Matrix Factorization (NMF), both of which minimize the Mean Squared residual Error (MSE) between the original TF-IDF data matrix and a reconstruction of the matrix from its low-dimensional approximation. Recall that our data is the term-document TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e.
tfidf(d1, t1) · · · tfidf(d1, tm) X = tfidf(d2, t1) · · · tfidf(d2, tm)
… . . .
tfidf(dn, t1) · · · tfidf(dn, tm)
Latent Semantic Indexing (LSI): The LSI representation is obtained by computing left and right singular vectors corresponding to the top k largest singular values of the term-document TF-IDF matrix X.
We perform Singular Value Decomposition (SVD) on the matrix X, resulting in X = UΣVT where U and V orthogonal. Let the singular values in Σ be sorted in descending order, then the first k columns of U and V are called Uk and Vk respectively. Vk consists of the principle components of matrix X in the feature space.
Then we use (XVk) (which is also equal to (UkΣk)) as the dimension-reduced data matrix, where rows still correspond to documents, only now each data point can be represented in a (far) lower dimensional space. In this way, the number of features is reduced. LSI is similar to Principal Component Analysis (PCA), and you can see the lecture notes for their relationships.
Having obtained U and V, to reduce the test data, we ONLY multiply the test TF-IDF matrix Xt by Vk, i.e. Xt,reduced = XtVk. By doing so, we actually project the test TF-IDF vectors onto the previously learned principle components from training, and use the projections as the dimensionality-reduced data.
Non-negative Matrix Factorization (NMF): NMF tries to approximate the data matrix X ∈ Rn×m (n = |D| docs and m = |W| terms) with WH (W ∈ Rn×r, H ∈ Rr×m). Concretely,
it finds the non-negative matrices W and H s.t. ∥X − WH∥2F is minimized (∥A∥F ≡ A2ij i,j
Then we use W as the dim-reduced data matrix, and in the fit step, we calculate both W and H. The intuition behind this is that we are trying to describe the documents (the rows in X) as a (non-negative) linear combination of r topics:
xT w w ··· w hT 1 11121r1
. ….. X=.≈WH= . . . . .
xTn wn1wn2···wmr hTr
HereweseehT1,…,hTr asr“topics”,eachofwhichconsistsofmscores,indicatinghowimpor-
tanteachtermisinthetopic. ThenxTi ≈wi1hT1 +wi2hT2 +···+wirhTr,i=1,…,n.
Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative) linear combinations of the “topics” we learned in the fit step. The “topics”, again, are the rows of H matrix, {hTi }ri=1. How do we do that? Just solve the optimization problem
min ∥Xt − WtH∥2F Wt≥0
where H is fixed as the H matrix we learned in the fit step. Then Wt is used as the dim-reduced version of Xt.
QUESTION 4: Reduce the dimensionality of the data using the methods above:
• Plot the explained variance ratio across multiple different k = [1, 10, 50, 100, 200, 500, 1000, 2000] for LSI and for the next few sections choose k = 50. What does the explained variance ratio plot look like? What does the plot’s concavity suggest?
• With k = 50 found in the previous sections, calculate the reconstruction residual MSE error
when using LSI and NMF – they both should use the same k = 50. Which one is larger, the
∥X−WH∥2 in NMF or the X−U Σ VT2 in LSI and why?
In this part, you are asked to use the dimensionality-reduced training data from
LSI with your choice of k to train (different types of) classifiers, and evaluate the 6
Classification Algorithms
trained classifiers with test data. Your task would be to classify the documents into two classes (for now a binary classification task) sports versus climate.
Classification Measures: Classification quality can be evaluated using different measures such as precision, recall, F-score, etc. Refer to the discussion material to find their definition.
Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have different levels of significance. In order to characterize the trade-off between the two quantities, we plot the receiver operating characteristic (ROC) curve. For binary classification, the curve is created by plotting the true positive rate against the false positive rate at various threshold settings on the probabilities assigned to each class (let us assume probability p for class 0 and 1 − p for class 1). In particular, a threshold t is applied to value of p to select between the two classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for each value of t. The ROC is the curve of TPR plotted against FPR.
Support Vector Machines (SVM): Linear Support Vector Machines are seemingly effi- cient when dealing with sparse high dimensional datasets, including textual data. They have been shown to have good generalization and test accuracy, while having low computational complexity.
These models learn a vector of feature weights, w, and an intercept, b, given the training dataset. Once the weights are learned, the label of a data point is determined by thresholding wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data point belongs to either class, by applying a logistic function instead of hard thresholding, i.e. calculating σ(wTx + b).
The learning process of the parameter w and b involves solving the following optimization problem:
∥w∥2 + γ ξi i=1
yi(wTxi +b)≥1−ξi ξi ≥ 0,
∀i ∈ {1,…,n}
where xi is the ith data point, and yi ∈ {0, 1} is its class label.
Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data. On the other hand, minimizing the first term, which is basically a regularization term, corresponds to maximizing the margin between the two classes. Note that in the objective function, each slack variable represents the amount of error that the classifier can tolerate for a given data sample. The trade-off parameter γ controls relative importance of the two components of the objective function. For instance, when γ ≫ 1, misclassification of individual points is highly penalized. This is called “Hard Margin SVM”. In contrast, a “Soft Margin SVM”, which is the case when γ ≪ 1, is very lenient towards misclassification of a few individual points as long as most data points are well separated.
QUESTION 5: Compare and contrast hard-margin and soft-margin linear SVMs: • Train two linear SVMs:
– Train one SVM with γ = 1000 (hard margin), another with γ = 0.0001 (soft margin).
– Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of both SVM classifiers on the testing set. Which one performs better? What about for γ = 100000?
– What happens for the soft margin SVM? Why is the case? Analyze in terms of the confusion matrix.
∗ Does the ROC curve of the soft margin SVM look competitive? Please explain.
• Use cross-validation to choose γ (use average validation 3 accuracy to compare): Using a 5-fold cross-validation, find the best value of the parameter γ in the range {10k| − 3 ≤ k ≤ 6,k ∈ Z}. Again, plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this best SVM.
Logistic Regression: Logistic regres