|dc.description.abstract||The rapidly growing World WideWeb (WWW) is no longer a passive information provider. Nowadays, Internet users themselves have become contributors to the WWW. A lot of user generated data, along with non-user-generated data, make our world an informative, however, perhaps over-informed society. The increasing amount of unorganized, disordered, unstructured, or even randomly generated data drove the momentum of big data analysis, aiming to discover and learn the hidden
patterns behind the data. In this thesis, in particular, we look at two problems of mining knowledge from data.
In the first project, we are trying to classify "democrats" and "republicans" in Twitter. We first propose a sentiment-based classification model to classify "democrats" and "republicans", with the aim to address the problem that conventional quantitative features, such as tweet count, follower-to-following ratio, election tweet count, cannot reflect the opinion alignment of tweeters. Therefore we utilize sentiment scores over multiple topics as our feature vector in the classification model. We innovatively proposed an automatic topic selection model to learn those distinguishing topics, making the sentiment feature selection domain independent. However, the sentiment-based classification model is not doing much better than non-sentiment model. Given the fact that sentiment-based classification model is not doing well enough, we propose using social relationship graph information to adjust our sentiment vectors. The graph-adjusted sentiment model achieves an accuracy higher than 80 percent in classification. What's more, we deploy a completely graph-based model, Belief Propagation (BP) model on the social graph, which achieves a prediction accuracy higher than 85 percent. We conclude that the effect of social relationship graph is more important than sentiment of tweets for classifying users into "democrats" and "republicans".
In the second project, we propose an alternative and new way to rank graduate schools using algorithms, instead of using qualitative surveys as U.S. News does. Based on the assumption that "schools tend to hire PhD graduates from better or peer schools" to become their faculty members, we propose deploying link-based ranking algorithms on the "hiring graph" among universities. We refine PageRank (PR) algorithm and Hyperlink-induced Topic Search (HITS) Algorithm by taking the edge weight into consideration, as our own way to rank graduate programs. In order to validate our approach, we collect two separate data sets to construct the "hiring graph", faculty data in top 50 Computer Science (CS) programs and faculty data in top 50 Mechanical Engineering (ME) programs across the United States. By comparing our new rankings with U.S. News ranking, we discover that some programs are either under-ranked or over-ranked by U.S. News. We also conduct extensive data analysis on our data, revealing a lot of interesting patterns and cases behind the U.S. News ranking. Finally, we conduct sensitivity analysis on each proposed algorithms to see how sensitive they are in response to changes in the "hiring graph".||