Show simple item record

dc.contributor.advisorFuruta, Richard K.
dc.contributor.advisorShipman, Frank M.
dc.creatorBogen, Paul
dc.date.accessioned2012-02-14T22:19:37Z
dc.date.accessioned2012-02-16T16:15:25Z
dc.date.available2014-01-15T07:05:34Z
dc.date.created2011-12
dc.date.issued2012-02-14
dc.date.submittedDecember 2011
dc.identifier.urihttp://hdl.handle.net/1969.1/ETD-TAMU-2011-12-10338
dc.description.abstractDigital collections are ubiquitous. However, not all digital collections are the same. While most digital collections have limited forms of change - primarily creation and deletion of additional resources - there exists a class of digital collections that undergo additional kinds of change. These collections are made up of resources that are distributed across the Internet and brought together into the collection via hyperlinking. This means the underlying collection members are not controlled by the curator of the collection. Resources can be expected to change as time goes on. To further complicate matters these collections can be hard to maintain when they are large, highly dynamic, or lacking active curation. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. While others have tried to address this problem by measuring change and defining a maximum allowed threshold of change, these methods treat all change as a potential problems and treat web content as a static document despite its intrinsically dynamic nature. Instead, I approach the problem of determining significance of change on the web by embracing it as a normal part of a web document's lifecycle, Instead of using thresholds to identify abnormal changes, I determine the difference between what a maintainer expects a page to do and what it actually does. These models are created using a variety of feature extractors to find pertinent information in a page, a Kalman filter to model the history of a page and predict a next version and finally classification of results into either expected or unexpected change. I evaluate the different options for extractors and analyzers to determine the best options from my suite of possibilities. This work is informed by a series of studies on both web pages and potential collection maintainers, observations of the NSDL Pathways, and a ground-truth set of blog changes tagged by a human judgment of the kind of change. The results of this work showed a statistically significant improvement over a range of traditional threshold techniques when applied to the collection of tagged blog changes.en
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.subjectCollection Managementen
dc.subjectDigital Librariesen
dc.subjectDistributed Collection Manageren
dc.subjectBlogsen
dc.subjectChange on the Weben
dc.titleIntelligent Information Interaction for Managing Distributed Collections of Web Documentsen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberLeggett, John J.
dc.contributor.committeeMemberBurkart, Patrick
dc.type.genrethesisen
dc.type.materialtexten
local.embargo.terms2014-01-15


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record