Text Mining Challenge

Now that you have completed some simpler exercises in the text mining unit, you are ready for a real challenge. This will require you to understand something about conducting an error analysis. So you will need to read completely through the LightSIDE user’s manual to prepare. See the file called example.doc for a walk through on a different dataset.

Dataset Description:

Most likely at some point when you have searched for information on-line, you have found something of interest on the well known Wikipedia site at http://www.wikipedia.org/. Whereas in the early days of Wikipedia, almost anyone was free to make contributions to Wikipedia, now access to certain functions is restricted to people who are accepted as administrators. The goal of your machine learning work for this test of practical competence is to train a classifier that can predict whether an applicant is going to get accepted as an administrator or not based on what they say on their application about themselves as well as other information about them, which is found in the wikipedia.csv data (included also in csv form) set provided for you.

SuccessfulRFA is what you are trying to predict: FALSE indicates that the applicant did not get the job, whereas TRUE indicates that they did. What is found in the text column is what the applicant said about himself. RFAYearMonth is when the applicant applied. The other extra attributes refer to counts of different types of editing behaviors on wikipedia,i.e., edits to different namespaces (sections) of Wikipedia. ArtEdits is the number of edits made to articles, ArtTalkEdits are article “talk” pages, UserEdits are user pages, UserTalkEdits are user “talk” pages, WPEdits are wikipedia namespace edits (those are mainly policy pages), WPTalkEdits are wikipedia namespace talk pages. TotalEdits is the total number of edits to all namespaces (including several other smaller ones).

You can find more information about Wikipedia administrators and the process of applying for an administrator position at http://en.wikipedia.org/wiki/Wikipedia:Administrators and http://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship.

Your Goal:

In this dataset the meta-data features, which describe each applicant’s editing behavior, are more strongly predictive of the class value than the text features, and adding text features to the meta-data features may not significantly improve performance. So instead of primarily working towards improving upon what is possible using the meta-data features, your job is to understand what is the unique contribution of each type of feature and why it is difficult to improve performance by adding text features to the meta-data features with this data set.

Step-by-Step Guide:

Manually examine some examples of given wikipedia data in wikipedia.csv and observe what could be likely features that could predict whether an applicant gets the moderator job or not. You should split the data into two sets, a small development set that you will use for qualitative analysis and to inspire feature engineering ideas, and a cross-validation set you will use for building and evaluating models.
Create a feature space from the cross-validation set with only column features and the class attribute (SuccessfulRFA). You can do this easily using the column feature extraction plugin in LightSIDE. Then save it as baseline-meta.arff. See what is the best performance you can get in model building experiments with this feature space. Make a note of what the performance is.
Now create a feature space with only text features from the cross validation set. and the class attribute, but none of the column features, and the class attribute (SuccessfulRFA). Save this feature space as baseline-text.arff. Then, in LightSIDE, do a cross validation experiment where you use feature selection to select the top 200 features on each fold and use SMO to build the model. Make a note of the performance.
Do an error analysis on the two baseline feature spaces and determine where the machine learning algorithm is making mistakes in both cases. Do they tend to make the same mistakes? Or are they confused on different examples? What are problems you identify with each of these feature spaces using the Explore Results interface.
Now create a feature space that combines both text and column features. Based on your error analyses, add any additional features you think will help address the problems you identified. You may use any classifier and it’s up to you whether you use feature selection or not, and if so, how many features to select.
Compare the performance obtained with your 3 models in LightSIDE and determine whether any observed difference in performance is statistically significant.

Complete This Assignment

Tutorials for this Assignment

0 Responses Completed for this Assignment

0 Tutorials for this Assignment

Leave a Reply Cancel reply

Data cycle

Create an interactive dashboard

Assignment: Interpretation of Community Formation in a Distributed Course across Social Media

Assignment: Interpretation of Creative Potential in Network Dynamics across Social Media

Your data and your questions?