Show simple item record

dc.contributor.advisorShipman , Frank
dc.creatorPoursardar, Faryaneh
dc.date.accessioned2019-01-23T17:09:59Z
dc.date.available2019-01-23T17:09:59Z
dc.date.created2018-12
dc.date.issued2018-12-10
dc.date.submittedDecember 2018
dc.identifier.urihttps://hdl.handle.net/1969.1/174344
dc.description.abstractSystems for retrieving or archiving Internet resources often assume a URL acts as a delimiter for the resource. But there are many situations where Internet resources do not have a one-to-one mapping with URLs. For URLs that point to the first page of a document that has been broken up over multiple pages, users are likely to consider the whole article as the resource, even though it is spread across multiple URLs. Comments, tags, ratings, and advertising might or might not be perceived as part of the resource whether they are retrieved as part of the primary URL or accessed via a link. Understanding what people perceive as part of a resource is necessary prior to developing algorithms to detect and make use of resource boundaries. A pilot study examined how content similarity, URL similarity, and the combination of the two matched human expectations. This pilot study showed that more nuanced techniques were needed that took into account the particular content and context of the resource and related content. Based on the lessons from the pilot study, a study was performed focused on two research questions: (1) how particular relationships between the content of pages effect expectations and (2) how encountered implementations of saving and perceptions of content value relate to the notion of internet resource bounds. Results showed that human expectations are affected by expected relationships, such as two web pages showing parts of the same news article. They are also affected when two content elements are part of the same set of content, as is the case when two photos are presented as members of the same collection or presentation. Expectations were also affected by the role of the content – advertisements presented alongside articles or photos were less likely to be considered as part of a resource. The exploration of web resource boundaries found that people’s assessments of resource bounds rely on understanding relationships between content fragments on the same web page and between content fragments on different web pages. These results were in the context of personal archiving scenarios. Would institutional archives have different expectations? A follow-on study gathered perceptions in the context of institutional archiving questions to explore whether such perceptions change based on whether the archive is for personal use or is institutional in nature. Results show that there are similar expectations for preserving continuations of the main content in personal and institutional archiving scenarios. Institutional archives are more likely to be expected to preserve the context of the main content, such as additional linked content, advertisements, and author information. This implies alternative resource bounds based on the type of content, relationships between content elements, and the type of archive in consideration. Based on the predictive features that gathered, an automatic classification for determining if two pieces of content should be considered as part of the same resource was designed. This classifier is an example of taking into account the features identified as important in the studies of human perceptions when developing techniques that bound materials captured during the archiving of online resources.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectWeb preservationen
dc.subjectHCIen
dc.titleIdentifying the Bounds of an Internet Resourceen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A & M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberFuruta, Richard
dc.contributor.committeeMemberCaverlee, James
dc.contributor.committeeMemberMcNamara, Ann
dc.type.materialtexten
dc.date.updated2019-01-23T17:10:00Z
local.etdauthor.orcid0000-0003-3862-740X


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record