Professional Documents
Culture Documents
Domain Characteristics
antithetical experience to searching; users are not required to have a good idea of what they
are looking for before engaging in the browsing experience. Engineering this sort of
serendipitous discovery relies upon providing a highly relevant experience to users in much
the same way that search depends upon relevance. One way that libraries have traditionally
attempted to tackle the issue of relevance is through classification via the Library of
Congress system, among others, and the use of metadata about the materials. A sort of
pseudo-browsing experience is often enabled in library search systems whereby users are
able to engage in faceted browsing once an initial search has been made. This functions
much like many popular and highly usable online shopping sites whereby results can be
narrowed and expanded by adding or removing categories or other metadata descriptors.
While this is a proven method for interfacing with search results listings, it does not address
the scenario whereby the user does not know what to initially search and it does not utilize
the expertise embedded in the collective “mindspace” of the university's librarians and
faculty.
One method to create a truer browsing experience and utilize the expertise in the
university that deserves exploration is the use of a recommendation framework to generate
highly relevant resources for users. One activity that both professors and librarians are
already doing is creating lists of resources tailored to specific classes, topic areas, or
disciplines. Professors do this every time they develop a reading list for a particular course
whereas librarians do this when working with specific courses to help students with their
research, when creating subject guides, and when curating targeted physical collections.
Each of these instances of lists are carefully curated by professionals who are experts in
their domain. As such, an item simply existing in one of these lists is different from any other
item in the libraries' collections that does not show up in one of these lists. Furthermore, an
item that shows up in more than one list is different from an item that occurs in only one list.
Due to this, an occurrence of an item in one of these lists can be thought of as a weighted
vote for that item. Using this frame of thought, a recommendation system can be developed
to take into account these weighted votes. The resulting recommendations can be presented
to users in an interface tailored to browsing and focusing on connecting highly relevant
materials to each other.
Academic materials in the form of books and research papers have specific
characteristics that make working with them different than working with popular books,
commercial products, or other types of more ordinary recommender domains such as
movies or restaurants. The most prescient of these differences is the shear quantity of
academic resources available. When attempting to find materials on a broad topic, users will
most often run into problems of scale in which there are multiple orders of magnitude greater
number or resources in existence in a particular domain than the user needs or can process.
In addition, while most all of these materials have metadata, the metadata is often less useful
than with more popular and commercial areas.
The issue of ineffectiveness of metadata is caused by a number of reasons, but one of
them is the highly specified language used to describe resources in different subject areas.
Over time, academia has developed an ever broadening set of fields and disciplines as
scientific methodology has necessitated ever increasing levels of specialization. This creates
a branching effect, which allows for specialization but also can tend to separate disciplines
Kevin Champion of Team Serendipity 3
from each other. Since each discipline develops its own language to describe itself, highly
related disciplines that end up on different branches sometimes lose connection to each
other. However, reality dictates that a more appropriate metaphor is a web and that even
though academia has created branches that effectively put fields into their own silos, they
often share many characteristics and ideas in common. One way of connecting these
branches once more is to build a web by mapping resources. However, instead of using
metadata to do this, there is an opportunity to use co-occurrence in lists to draw
connections between distinct resources. This idea is not dissimilar to the groundbreaking
“PageRank” algorithm developed by Google to weight webpages by the number of links
from other webpages. In fact, many of the characteristics specific to academic resources are
also found in the characteristics of webpages (scale, ineffective metadata). As a result, if we
think of Google's results as an explicit form of recommendation engine, it is not a stretch to
the applicability of a recommender system to this idea of linking together academic
resources.
Design Dimensions
Note on privacy
Public and academic libraries place supreme importance on the privacy of their users.
As a result, they do a number of things to ensure that users maintain privacy when using
library resources and services. One of the most important mechanisms that libraries employ
for maintaining privacy is to simply not track and store user behavior. What this means is
that libraries intentionally purge information about what resources a particular user has
checked out or viewed in the past. This makes it so that libraries are not put in an ethically
compromised position if the government comes to them asking for information about a
particular user; if they do not have any information they do not have to break this user's
privacy.
This policy has implications for developing user interfaces and experiences. Since the
libraries do not store this information, they cannot create an interface that allows a user to
login to her account and view her previous checkout history, for instance. It also has
implications for the type of recommendation systems that can be built for library materials.
Since user-data is intentionally purged, user-user algorithms and collaborative filtering
techniques are not possible because we have to presume an environment where we do not
have user specific information. Users are unable to rate resources and the library does not
have a way of tracking browsing behavior to an individual user over time. That said, there are
some libraries who are developing opt-in systems to enable some of these features if users
agree to the privacy implications. Nonetheless, this paper attempts to outline a system that
can be effective using item-item algorithms and content filtering approaches in the university
context without the need for user specific data.
Kevin Champion of Team Serendipity 4
Content-based recommending
One way of building a recommender for library resources is to utilize the reading and
resource lists created by professors and librarians. In terms of the recommender, each list
can be considered similarly to a user and each item in a list can be considered a vote for that
item. Since these lists are curated by domain experts, we can operate with confidence that
by considering items' existence in a list as a vote the recommender will be working with
resources that have both a high degree of quality and relevance. Therefore, this technique
will result in a matrix of lists and items where the lists are along one axis and the items are
along the other. Using this matrix, a simple item-item recommender algorithm can be used
to recommend related resources to the current resource being viewed. Additionally, since
there will be a much larger number or resources than there will be lists and since there will
be a relatively small amount of overlap from resources that are listed more than once (ie.
resources that have more than one vote), this matrix will work best with algorithms that deal
well with sparse ratings matrices. Consequently, an SVD algorithm can be used here to
discover features of the items based on their votes and make recommendations based on
these features.
Along with this way of counting instances of resources in lists as a vote, other
content-based techniques can be used in other algorithms to develop interesting
recommendations. There are a number of useful pieces of metadata that each resource is
likely to have. Most of these forms of metadata will be useful for recommending similar
items, but the following will be most effective: list metadata that details the course the list is
used for and the subject area/s the list items deal with, item subject data derived from the
Library of Congress subject headings, full-text descriptions of items derived from abstracts
and summary paragraphs, and additional subjects or “tags” applied to the items by the
professors and librarians when they list them. Of these four types of metadata, three of them
are essentially keywords that are used to categorize the items into some sort of taxonomy.
As such, these can all be grouped together and used to create a content keyword frequency
matrix, which can then be fed into a content-filtering SVD algorithm. The other type of full-
text metadata descriptions can be used by first running them through a natural language
processing machine in order to derive keywords from the descriptions. However, the
resultant keywords should probably not be combined with the categorical keywords
contained in the other metadata because they come from an uncontrolled vocabulary and
are not used for taxonomical purposes. All of the other metadata come from controlled
vocabularies and will thus result in a more concentrated frequency matrix. Adding the natural
language keywords would pollute this concentration rendering this keyword matrix less
effective. Instead, the abstract keywords can construct their own content keyword frequency
matrix which can be fed into another SVD algorithm to generate similarities.
It must also be mentioned that by thinking of lists as users and the existence of items
in lists as votes for those items, it is possible to utilize a pseudo-user-user algorithm. In this
case the user-user algorithm might be better described as a list-list algorithm. By running the
vote matrix into a list-list algorithm, it would be possible to generate recommendations of
other lists similar to the current list being viewed. The simple user-user algorithm would
Kevin Champion of Team Serendipity 5
discover these similar lists by finding lists that had the greatest number of resources co-
listed in each. Since we have already discussed that this sort of overlap will be sparse (even
though it will exist), an SVD algorithm would be more effective in this instance as well
because it could discover relationships in a more relationally complex way, which will
hopefully result in more relevant recommendations of similar lists.
Design recommendations
In order to create the most useful recommendation system, I recommend the use of
most of the algorithmic content-based techniques mentioned above. In this vein, I think a
hybrid system should be used to help surface related resources to users. When a user views
a particular item in a particular list, the recommendation system will employ a number of
techniques to feed recommendations into the interface.
The matrix of “votes” created from resources existing in lists will be fed into a SVD
algorithm using ten features (in the optimization process the number of features will be
tweaked to get optimal results). The content keyword frequency matrix derived from subject
classifications will also be fed into an SVD algorithm. Using the output features and
weightings of each SVD calculation, a weighted feature combination technique will be
employed to join the features of the two content-based SVDs. This combinatorial approach
is a hybrid itself of the “weighted” (Burke, 2002, p. 339) and “feature combination” (Burke,
2002, p. 341) hybridization techniques and will work by first artificially inflating the weightings
of the list-based SVD to give its results primacy and then will combine the features so that
one set of recommendations is output. This approach of combining the SVDs will allow the
resources' existence in lists to reveal relationships, but will also utilize the inherent
taxonomic connections between resources that have been described by classification
experts.
In addition to these two algorithms, a third will be run on the content keyword
frequency matrix created from the full-text abstracts of each resource. This matrix will be fed
into its own SVD and the weightings and features from it will be combined with the hybrid
results which have already been combined. It will do this using one of two techniques
depending on the interface to be employed: weighted combination or “mixed” combination
(Burke, 2002, p. 341). If the desired interface requires only one set of recommendations then
a weighted combination will occur whereby the full-text recommendations will be weighted
as less important than the combined vote and subject keyword based recommendations.
This is the case because the full-text derived recommendations do not take into account the
professors' and librarians' expertise, which is a key element missing in current systems that
this paper proposes will lead to better recommendations. In interfaces which can
accommodate a more complex display, the full-text recommendations will be displayed in a
separate location alongside the other recommendations using a “mixed” strategy.
Lastly, a third type of recommendation will be used to recommend other lists similar to
the current one being viewed. For this set of recommendations the original matrix of
resources and lists will be sent to an SVD to discover features of the lists, which will lead to
Kevin Champion of Team Serendipity 6
recommendations for other lists. These recommendations will be displayed in the interface
apart from all the other recommendations as they are distinct.
Performance
Performance in this system is not likely to be a problem because almost all of the
computation can happen offline before recommendations will be used. Since this system
does not have to deal directly with user input via ratings or other usage metrics, it has little
need to update in real-time. The main event that would require that the algorithms be run is if
a list was added to or edited in the system. Since this will happen only semi-frequently, most
computation can happen offline without negatively impacting the user-experience.
Interface
has been listed in. This type of element has been pioneered in the commercial sector with
features like Amazon's Listmania. In it, shoppers create lists of products and then when
navigating to a product that has been listed, an element is added to the item which displays
the lists it belongs to and related items to it from those lists.
Future possibilities
Along with the recommendations already mentioned, if this system were put in place
and were successful, there would be a lot of opportunity to expand the system by
Kevin Champion of Team Serendipity 8
Note on sources
This paper was developed in concert with a project I am doing to submit to the
iDesign competition for the University of Michigan Libraries. Due to this, much of the domain
specific information and knowledge within was ascertained from a series of interviews and
discussions with University of Michigan librarians and library staff, along with staff of the
Open.Michigan program. Also of note is that this paper outlines a number of aspects of the
actual design I will be submitting for the iDesign competition.
Kevin Champion of Team Serendipity 9
References
Burke, R. (2002). Hybrid recommender systems: Survey and experiments. User Modeling
and User-Adapted Interaction, 12(4), 331.