Tech Report NWU-CS-05-08: A System For Indexing and Archiving RSS Feeds

Computer Science Department
Technical Report NWU-CS-05-08 June 6, 2005 RAIn: A System for Indexing and Archiving RSS Feeds
Jeff Cousens and Brian Dennis
Abstract
Really Simple Syndication, or RSS, provides a way for users to monitor a web site for changes. One of the most popular uses of RSS is to syndicate a web log. RAIn, for RSS Archiver and Indexer, is a system for monitoring and archiving RSS feeds, and for indexing their contents. This report provides a discussion of the design and implementation of RAIn. The report also includes a summary of RAIns results over a two-week period, illustrating both how a small, low-end system is capable of monitoring a significant number of feeds and the types of statistics RAIn is capable of producing.
Keywords: Really Simple Syndication, RSS Feed Crawler, RSS Feed Statistics, RSS Feed Indexing, Python
Table of Contents
INTRODUCTION ............................................................................................................................................... 4 OVERVIEW ......................................................................................................................................................... 4 DESIGNING A FEEDCRAWLER ................................................................................................................... 5 FETCHERS ........................................................................................................................................................... 6 The FeedFetcher .......................................................................................................................................... 7 The CurlFetcher and SharedCurlFetcher .................................................................................................. 7 FEED N EWNESS .................................................................................................................................................. 7 ARCHIVING ........................................................................................................................................................ 8 INDEXING ........................................................................................................................................................... 8 QUERYING ........................................................................................................................................................10 The FeedSearcher ......................................................................................................................................10 The FeedAnalyzer ......................................................................................................................................11 FEED DISCOVERY ............................................................................................................................................11 STORAGE ..........................................................................................................................................................12 Schema........................................................................................................................................................12 DESIGN DECISIONS AND LESSONS LEARNED .................................................................................................14 EXPERIENCES USING RAIN .......................................................................................................................16 EXPERIMENTAL SETUP ....................................................................................................................................16 ANALYSIS .........................................................................................................................................................17 CONCLUSIONS ................................................................................................................................................25 REFERENCES...................................................................................................................................................27
Introduction
The Internet is changing writing. Fifteen years ago, one had to convince a publisher to accept a manuscript before one could become a real author. Publications were limited to books, magazines and papers. Writing was something tangible, something that required effort and overhead to produce and distribute. Then came the World Wide Web. Anyone could create a home page. Companies like Tripod and Geocities enabled everyone to get web space for free and publish anything they wanted. There was no longer any editorial approval or need to sell copies. Recently, interest in publishing on the Web has led to an explosion in the popularity of web logs, or blogs. Blogs make it easy to maintain repeatedly updated sites. People now are creating electronic diaries for the entire world to read. Where once everyone had a home page, now everyone has a blog. Some authors post infrequently and personally, while others take their blogs very seriously and professionally. In 2004, blogs played a significant role in the US Presidential Election. At both the Democratic and the Republican National Conventions, bloggers stood beside traditional journalists. Providing real-time coverage via wireless devices, these blogs were the main source of coverage of the conventions for many people [8]. The blogging phenomenon is still relatively young. While some sites exist to gather blog statistics, they often keep the information close, only revealing a very small subset of the information gathered. There are millions of blogs out there [14], and little is known about them. How do people use blogs? What are their posting habits? More interesting are the stories told by these blogs. What hot news topic is everyone discussing? When did they start? RAIn was created to help answer these questions. RAIn is a software system capable of collecting information about hundreds of thousands of blogs, allowing us to examine the behavior of a large community of blogs. The system was designed to analyze the behavior of communities of feeds, not blogging on the whole. While capable of handling hundreds of thousands of blogs, it was not designed to compete with sites like Technorati [15] or Syndic8 [2] that attempt to perform exhaustive monitoring of every blog in the blogosphere the world of web logs. The system is fairly lightweight, permitting it to be run on commodity hardware; modular, enabling components to be changed or extended; and flexible enough to be adapted to a wide variety of queries.
Overview
Really Simple Syndication, or RSS, is a lightweight XML based method for sharing web content. It provides a low-bandwidth way for users to watch a web site for changes. As blogging has exploded in popularity, so has RSS. All major blogging packages include support for syndication using RSS. RSS comes in two common versions: RSS 0.9x [10]
and RSS 2.0.x [17]. Depending upon the implementation, an RSS feed may contain anything from a list of headlines with brief summaries to the full contents of a blogs articles. A blogs RSS 2.0.1 feed might look like:
<rss version="2.0"> <channel> <title>Technology at Harvard Law</title> <link>http://blogs.law.harvard.edu/tech/</link> <description>Internet technology hosted by Berkman Center.</description> <pubdate>Tue, 04 Jan 2005 04:00:00 GMT</pubdate> <item> <title>RSS Usage Skyrockets in the U.S.</title> <link>http://blogs.law.harvard.edu/tech/2005/01/04#a821</link> <description> Six million Americans get news and information from RSS aggregators, according to a <a href="http://www.pewinternet.org/pdfs/PIP_blogging_data.pdf">nati onwide telephone survey</a> conducted by the Pew Internet and American Life Project in November. </description> <dc:creator>Rogers Cadenhead</dc:creator> </item> </channel> </rss>
Minimally, an RSS feed is a sequence of loosely structured items. RSSs ease of use and popularity has led to a syndication ecology, where readers monitor a sites RSS feed instead of the site itself. With RAIn, the goal was to create a system for monitoring, archiving and analyzing RSS feeds. The system is designed to be modular, permitting new components to be added or various components to be changed out. This is true both in the design of the objects and packages and in the way that RAIn leverages its usage of Python [16], the high-level interpreted, object-oriented language RAIn was written in. The system is lightweight, requiring only inexpensive hardware to monitor hundreds of thousands of feeds on a daily basis.
Designing a FeedCrawler
RAIn is a modular system, consisting of an engine to manage feeds to be crawled and modular components to fetch the feeds, archive the results and index the contents, as well as components for answering queries and finding new feeds.
Figure 1: A diagram of RAIn's architecture Figure 1 shows RAIns architecture. The crawler determines which feeds need to be crawled, and creates a fetch thunk, an executable object containing a fetcher, an archiver, an indexer and connections to the Internet and database. The fetch thunks are then put in a crawl pool. As threads become available, fetch thunks in the crawl pool are executed. The fetch thunk thread fetches the feed from the Internet and, if updated, archives and indexes it. Every feed monitored by RAIn is stored in a database table. This table includes a variety of information about the feed, including the URL to check, the time and result of the last check, a status count, the time of the next check and information about the last fetch of the feed, including the HTTP ETag and Last-Modified headers for determining feed newness, if available, and an MD5 digest of the feed. On a user defined interval, RAIn checks to see if the database contains feeds that have yet to be checked or stale feeds. Feeds are considered stale when the next check time has passed. URLs of feeds to be checked are retrieved by the core module and dispatched to worker threads.
Fetchers
A Fetcher handles the HTTP retrieval and processing of a feed. It is responsible for determining whether the feed has changed, updating the status of the feed and the feed metadata, and archiving and indexing the feed, if necessary. Three different classes of Fetchers were created: 6
The FeedFetcher
The FeedFetcher is the most basic Fetcher module. In addition to a simple set of Fetcher routines, the FeedFetcher module also contains routines common to all Fetchers. It is written using only stock Python routines and uses Pythons urllib2 module to retrieve the feeds. This allows the FeedFetcher module to be used on any system where Python is available without any dependence on non-standard modules that might not be available on all platforms and versions. It contains a reasonable amount of intelligence to attempt to avoid overloading servers. However, the urllib2 handles are not reusable, so every feed crawled requires a new handle to be instantiated.
The CurlFetcher and SharedCurlFetcher

The CurlFetcher is an enhanced Fetcher module. It uses pycurl [6], a Python wrapper to libcurl. libcurl is a highly optimized C library for network operations. It implements features like caching the results of Domain Name System (DNS) queries and reusable handles, allowing for improved performance over Pythons urllib2. The CurlFetcher can be used in two different ways: per fetch (CurlFetcher) or per thread (SharedCurlFetcher). Per fetch, the CurlFetcher is similar to the FeedFetcher in that every feed crawled requires a new handle to be instantiated. Per thread, the SharedCurlFetcher creates one handle per thread when the FeedCrawler module is started. This saves the Fetcher the overhead of having to instantiate a new pycurl handle every time a new feed is fetched. This also allows the pycurl handle to cache DNS information across fetches, reducing network overhead. The performance differences between the FeedFetcher, CurlFetcher and SharedCurlFetcher were only briefly examined. As might be expected, the SharedCurlFetcher had the best performance in terms of number of feeds fetched per minute. There was not an obvious winner between the FeedFetcher and the CurlFetcher.
Feed Newness
Once a feed is retrieved, all Fetchers check to see if the feed has changed using several different metrics. If the HTTP headers contain an ETag field and the ETag field matches the previous ETag, or if the header contains a Last-Modified field and the Last-Modified field matches the previous Last-Modified, the feed is considered unchanged. If the feed is not found to be unchanged by ETag or Last-Modified header, an MD5 digest is taken of the entire feed. This is compared against a previous MD5 digest. If the digests are different, the feed is considered updated. Updated feeds are archived, both in their raw format for potential future analysis and as individual items.
Frequency of fetching is adaptive in an attempt to match the feeds change frequency. The next fetch time is determined by adding a fetch interval to the time the current fetch was performed. All feeds begin with an interval of 1 hour, which is then modified based upon the result of the current fetch. If the feed has changed since the last check, the fetch interval is reduced by 2 hours, to a minimum of 1 hour. If the feed was unchanged, or there was an error fetching the feed, the fetch interval is increased by 4 hours, to a maximum of 24 hours. The number of errors that occurs when fetching a feed is recorded; after too many consecutive failures, feeds are marked as removed and no longer checked. Adapting to a feeds frequency of change helps RAIn discover new items as they are posted without overloading a server by repeatedly checking for new items when a feed is unchanged. A rudimentary analysis showed that adaptive fetching worked: the fetch interval grew larger for infrequently updated feeds while it stayed small for frequently updated feeds. However, a more in-depth analysis over a longer time period would be necessary to determine how effective RAIns current implementation is.
Archiving
When a feed is determined to be new, the raw feed is archived in a database table. The feed is compressed using Pythons zlib module. This is important as feeds, being text, compress to somewhere between 5 and 10% of their original size. The compressed feed is then inserted into a binary database field, along with the date that the feed was archived. This raw feed can then be retrieved and uncompressed for later analysis. The feed is also parsed using Mark Pilgrims Universal Feed Parser [12] into individual entries in the feed. The items are then individually checked against the database using an MD5 digest to filter out items that have already been processed. New items are serialized using Pythons pickle module, compressed using Pythons zlib module and stored in a binary database field. These items can be retrieved, uncompressed and unserialized to access the raw contents of the item, including the full text of the item. Information about the items contained within the feed is stored, including the URL of the item, the time that the item was posted, if available, the time that the item was archived and an MD5 digest of the item for later comparison. This information can be used to look for items from a certain site or in a certain date range. It can also be used to retrieve the item referenced in the feed from the feeds web site.
Indexing
In order to facilitate querying and analyzing the content of the feeds, the words in the feeds are indexed. This is a daunting task. Assuming the crawler is able to retrieve 200,000 feeds per day, that only 50% of these feeds contain new items and that the average item contains 20 words, the crawler will store 2,000,000 words per day. At that 8
pace, in less than two months the crawler will have indexed more words than are contained in the British National Corpus. In practice, the numbers are significantly higher. The design for indexing is based upon the method used by the WordHoard project at Northwestern University [11]. One aspect of WordHoard is an interface that allows literary scholars to search a corpus of works for words and to generate statistics, including frequency and counts. This is interesting to literary scholars as it can reveal patterns in the authors works as well as trends in literature on the whole. In WordHoard, works of literature are parsed into individual words. The words are stored in two different tables: a table of individual lemmas, for linguistic analysis across a corpus, and a table of word occurrences, for analyzing specific instances of a words usage. The word occurrence table stores complete information about every word and punctuation mark in every work in a corpus and, with some meta-information such as speakers and act/scene or page, may be used to reconstruct the entire work, word for word. Using WordHoard as a model, RAIn splits an RSS feed into individual items, then parses each item into individual words. Here RAIn departs slightly from WordHoards word occurrence model. With RAIn retrieving more than 2,000,000 new words per day, storing complete word occurrence information for any significant time interval would consume an unreasonable amount of storage. As a compromise, sentences are stripped of punctuation and filtered against a list of common stop words. The remaining words are then aggregated within an item and the distinct filtered words and their counts are stored in a database. While words can tell a story with what they say, URLs define relationships. They show how posts relate to other posts and how sites relate to other sites. Special attention is paid to URLs in order to track these relationships. When indexing, RAIn looks for common URL patterns and stores them in a separate table. For this purpose, a URL is considered to either be a string containing the pattern http:// or the contents of a HTML anchor elements href attribute. Using the information in these tables, many different types of queries are possible. One family of queries is that providing general statistical information: e.g. how long is the average item or what is the average ratio of URLs to words. Blogging is still young and not much is known about the posting habits of bloggers. Are more people verbose but infrequent posters or brief but frequent posters? Does time of day impact posting? How about day of week? These statistics begin to paint a picture. With the right constraints, these statistics can even tell a story. For example, by monitoring a collection of political blogs before, during and after a keystone event (e.g., a partys national convention or the State of the Union), one might be able to tell whether the event was motivating, discouraging, or even mostly ignored. More interesting are informational queries: e.g., how did the frequency of a word change over a given period of time or what are the most commonly used words. A political
scientist might wonder how the usage of Schiavo 1 changed during the period from February through April of 2005. When did people start posting about her? When did they stop? A linguist might wonder what the most commonly used words are, and how this changes over time. Where 18th century activists wrote books, many 21st century activists write blogs. Instead of being stored on paper, the snapshot of society todays authors provide us with is online. Many blogs now make RSS feeds available for comments as well. By monitoring comments, one can gauge reader response to certain topics. Which posts engendered the most comments? Which were largely ignored?
Querying
In order to facilitate analysis, a few different Python modules were created. These modules use two different approaches. The first is to provide a very generalized module, which can accommodate a wide variety of queries based upon RAIns indexing. The second is to provide a very specific module, only capable of providing a focused set of information but able to generate it in an optimized way.
The FeedSearcher
The FeedSearcher module provides a generalized interface to the RAIn database. It allows someone to retrieve a list of items, words or URLs based on a series of constraints, including date, word count, a set of feeds to search and a pattern, either an exact pattern, a substring or a regular expression. It even allows someone to limit the number of results, to specify an offset and to get more information about a result, tying a word or URL to an item, and an item to a feed. The following is an example of using the FeedSearcher to find the number of updates per day for a single day:
import FeedSearcher fs = FeedSearcher.FeedSearcher(localhost, db, user, pass) fs.type = entries fs.start_date = 2005-03-15 fs.end_date = 2005-03-15 results = {} for item in fs.execute(): for details in fs.getDetails(entries, item[feed_id]): if results.has_key(details[feed_url]): results[details[feed_url]] += 1 else: results[details[feed_url]] = 1
Terri Schiavo was a Florida woman in a persistent vegetative state whos right-to-life vs. right-to-die case became national news in March 2005 [1]. 10
This generates a Python hash table, or dictionary, using the feed URLs as keys with the number of times the feed was updated that day as values. Yet in order to achieve the flexibility of the FeedSearcher, the interface is generic. The FeedSearchers execute method does not return enough information, so the getDetails method must be invoked on every item returned by execute. Each call to getDetails involves a SQL query. Thus, in order to compute this statistic, the total number of SQL queries involved is the number of items plus one. For any large data set or complex query, a more optimized interface is desired.
The FeedAnalyzer
The FeedAnalyzer provides a very specific interface to the RAIn database. It is only capable of performing a fixed set of queries, but it performs them well and does it much better than the FeedSearcher could. The FeedAnalyzer was written for this report, and generated the statistics presented in the empirical study. It was designed from the top down, first looking at the information to present, then determining what queries were necessary to generate that information. The queries were optimized for the task, which provided faster query execution times and reduced the overall number of queries, while the results were returned in the format required for importing into Excel for analysis. The following is an example of using the FeedAnalyzer to find the number of updates per day for a range of days:
import FeedAnalyzer import mx.DateTime fa = FeedAnalyzer.FeedAnalyzer( localhost, db, user, pass mx.DateTime.ISO.ParseDateTime(2005-03-15), mx.DateTime.ISO.ParseDateTime(2005-03-31) result = fa.updatesPerDay()
As with the FeedSearcher, this also generates a Python dictionary using the feed URLs as keys with the number of times the feed was updated that day as values. However, the FeedAnalyzer generates this information using both fewer lines of Python and only one SQL query.
Feed Discovery
For some data sets, it is necessary to analyze a specific collection of feeds. However, sometimes all that is desired is a large collection of feeds. The FeedFinder module is designed to find new feeds to crawl. It is capable of visiting a blog tracking web site, such as blo.gs [18], and retrieving a list of RSS feed URLs. This list is then parsed by the FeedFinder and checked against the database (and itself) for duplicate feed URLs. New feeds are then added to the database for the FeedCrawler to crawl. This module was integrated into the FeedCrawler, although it may also be run independently.
11
Storage
RAIn currently stores all of its data in a relational database. It leverages Pythons DBAPI, allowing the database to be fairly easily swapped out. Modifications are necessary only when the schema changes due to differences in data types; e.g., PostgreSQLs bytea vs. MySQLs longblob; or to accommodate differences in modules support of data types; e.g. pyPgSQLs PgSQL.PgBytea versus psycopgs Binary. The database for RAIn is PostgreSQL [13]. Initially, pyPgSQL was used as the Python DB-API interface to PostgreSQL, and PostgreSQL performed well. However, as the database grew in size, performance fell off. At one point, RAIn was only processing thousands of feeds per day. To improve performance, the database interface module was switched from pyPgSQL to psycopg. While both modules provide a DB-API 2.0 compliant interface, psycopg was designed to be much faster than other modules. One important difference between pyPgSQL and psycopg is a bug with psycopg 1 and Unicode characters. This required a workaround to handle potential Unicode data. In addition, psycopg required more attention to be paid to transactions so updates would be available across all database handles.
Schema
RAIns database consists of six tables: webfeeds This table contains information about the RSS feeds being monitored by RAIn. It is updated every time a feed is crawled with information about the fetch operation. If the feed is changed, http_etag, http_last_modified, fetch_last_attempt, fetch_next_attempt, fetch_status, fetch_status_count, fetch_interval and fetch_digest are updated. If the feed has not changed since the last fetch, only fetch_last_attempt, fetch_next_attempt, fetch_status, fetch_status_count and fetch_interval are updated as the ETag, Last-Modified and MD5 digest will be unchanged. webfeeds_archive This table contains a zlib-compressed archive of every new RSS feed RAIn fetches as a binary object (e.g., PostgreSQLs bytea, MySQLs longblob) in data_bytes. The date that the feed was archived is stored in data_archived. This allows the data to be analyzed using a different methodology at a later date. webfeed_items This table contains information about the individual items in the feeds fetched by RAIn. webfeed_item_words This table contains all of the words found by the FeedIndexer. The count for each word per item is stored in word_count. To facilitate querying, a ts_vector for the word is stored in index_word.
12
webfeed_item_urls This table contains all of the URLs found by the FeedIndexer. To facilitate querying, a ts_vector for the URL is stored in index_url. webfeed_bundles This table contains information about bundles of RSS feeds, representing a many-to-one relationship between a group of feeds and a bundle name. This relationship allows groups of feeds to be tied together into a single bundle for analysis.
Figure 2: An entity-relationship diagram for RAIns database
13
In addition to feed information stored in the database, RAIn also stores low-level information in log files. The logging level may be configured from critical, logging only events that would prevent RAIn from running, to debug, logging almost every operation RAIn performs.
Design Decisions and Lessons Learned

Storage: By itself, an RSS feed does not represent a significant amount of data; typically only a couple of KB. However, when processing 200,000 feeds per day, 75% of which are updated at least once a day, that couple of KB adds up very quickly. Space rapidly became a concern during development and we had to make several changes as a result. Most important was implementation of accurate duplicate elimination. The initial design did not fully incorporate MD5 checksums. This was improved so that both the feed itself and the items within the feed are now checksummed. The feed is checked against the last feed to see if it has changed, while the item is checked against all items from that feed to ensure that it has not already been processed. Adding both of these checksums made a significant difference: feed checksums doubled the number of feeds marked as unchanged while item checksums reduced the number of items processed by more than 75%. Not only did these changes directly correspond to savings in storage, but they also allowed significant increases in the number of feeds crawled per day. However, even with these reductions in the amount of data stored, the amount of data stored is still very significant. Indexing: While indexing is a very simple process to implement, it is very difficult to make it work well. The average post contains 350 distinct words. The architecture of the word index, while very useful for statistical analyses, requires that each of these words be its own record. This means that every feed crawled requires an average of 355 database inserts. This is a very significant amount of database I/O and comes at a non-negligable cost. Stop wording is one approach to reducing the amount of data. While initially a basic set of stop words was used, during analysis it was discovered that additional stop words are necessary. Depending upon the data set being analyzed, it may be necessary to analyze a few weeks worth of data to get an adequate feel for which words are important and which are not. It is also important to consider the goal in using stop words. Some common stop words may actually be useful for answering certain questions; e.g., to analyze whether posts about men or women are more common, it would be desirable to have he/she and his/hers in the database. The size of the database also impacts performance. At 350 words per item, and 100,000 new items per day, the words table would accumulate more than 1 billion words in less than a month. Even with stop words and duplicate elimination, the size of the words table is very substantial. This poses its own set of concerns when analyzing or updating the database. It places constraints on database performance, file system usage and even database design and index usage. As the database grows, queries take longer to process. 14
Past a certain point, queries may no longer be performed interactively. Certain types of queries eventually become impossible. One solution to this problem is to change the database design. Currently, all of the words are stored in a single table. Switching to a design where a new table is created every day would limit the size of the individual tables, keeping search times interactive. Some data sets (e.g., most popular words) could be precomputed and the results stored in a separate, much smaller table. Another approach would be only to store frequency information in the database and not to maintain a full text index. The items could be stored as XML and indexed using a different mechanism, such as Lucene [1], Nutch [7] or XTF [5]. Database Design: Relational databases can provide a great deal of power and allow you to perform some very complicated queries very easily. However, as the size of the database increases, greater attention must be paid to the design of the database and how it impacts performance. The initial design of the stale feeds query used a not equals constraint to filter removed feeds:
SELECT feed_url, http_etag, http_last_modified, fetch_next_attempt, fetch_interval, fetch_digest FROM webfeeds WHERE now() >= fetch_next_attempt AND fetch_status != 16 ORDER BY fetch_next_attempt LIMIT 200;
PostgreSQL executed this constraint as a sequence scan, linear with the number of records, despite the presence of an index on the feed status. This was not a problem with a small database, but, as the number of feeds monitored increased, so did the time it took for the stale feeds query to execute. This had an impact on the crawlers performance, as a significant amount of processor time and database IO was lost waiting for this query to finish. To improve performance, the feed removal process was redesigned. Feeds continue to be marked as removed but the next fetch time is set for 1,000 years in the future. This allows use of a simple date filter, as it is unlikely that RAIn will be used with a current date of 3005. Since the next fetch field had previously been untouched once the feed was removed, providing a timestamp for when a feed was removed, a new field was added to record when a feed is removed. Relational databases allow the creation of constraints on data, enforcing a schema and ensuring data integrity. This protection comes at a cost. The initial design included referential integrity constraints connecting the various tables, ensuring that a webfeed_item and webfeed_archive had a corresponding webfeed, and that a webfeed_item_url and webfeed_item_word had a corresponding webfeed item. At one point in development, it was necessary to perform some deletions from the database to remove duplicate feeds. With referential integrity constraints present, the database needed
15
to perform complex joins and scans in order to process the deletion. Even with indexes in place, the queries took a significant amount of time. At that point, the decision to include referential integrity constraints was reevaluated and the constraints were removed, instead trusting RAIn to accurately insert data. Data Format and Encoding: One problem when crawling a disparate set of web data is that it comes in a wide variety of formats. This makes it difficult to predict what data format a feed will use. Two places this caused problems were in the character encoding and timestamp format. Most RSS feeds use either ASCII, ISO-8859-1 or UTF-8 encoding, but some use other encodings. Similarly, most HTTP servers use a standard timestamp format that can be handled by eGenix mxDateTime library [3] but some use uncommon formats not handled by mxDateTime; e.g., 2005-04-28T12:13:49Z. With feed finding enabled, it is likely that the database will eventually include some feeds using unhandled formats. As a result, careful attention must be paid to exception handling. On the data set RAIn was tested with, this happens about 0.005% of the time.
Experiences Using RAIn

A 16-day period at the end of March was selected to empirically investigate RAIns capabilities. Several hand-selected groups of RSS feeds were added to the database and monitored, along with approximately 77,000 existing feeds. When the period ended, the data from those feeds was exported and analyzed for several different statistics.
Experimental Setup
For this experiment, RAIn was run on a dual Pentium III at 933 MHz with 1 GB RAM running Debian GNU/Linux testing (sarge) with the Debian versions of PostgreSQL (7.4.7-2), Python (2.3.5-1), psycopg (1.1.18-1), pycurl (7.13.0-1) and libcurl3 (7.13.1-2). The OS and crawler were stored on an internal Ultra-160 SCSI disk. The database was stored on an external Ultra Wide SCSI attached RAID5 ATA-100 disk array. The server was connected to the Internet via a 100Mb Ethernet connection and configured to use a local installation of the DeleGate proxy server (8.9.6) with caching enabled. In order for analysis to produce meaningful results, the feeds analyzed should be representative of something. To that end, six bundles of feeds were defined for analysis, totaling 723 feeds. These bundles are Computers & Technology (150 feeds), Entertainment (132 feeds), Eszter Politics (107 feeds), Politics (232 feeds), Sports (52 feeds) and Subscriptions A-list (50 feeds). Four of these (Computers & Technology, Entertainment, Politics and Sports) are the feeds from Yahoos Directorys Weblogs categories. Eszter Politics is a handpicked collection of political feeds from a colleague in Northwesterns School of Communications who is studying political blogging. Subscriptions A-list is the intersection of several different top feeds lists. This allows us
16
to examine the behavior of specific blogging communities (e.g., the political community or the entertainment community) as well as to compare and contrast different communities. The database was not purged before data gathering. In addition to the 723 bundled feeds, there were 77,373 feeds that were not part of any bundle. These feeds were obtained from several different sources, including the list of recently updated feeds on blo.gs and the list of syndicated feeds on Syndic8, two web sites providing centralized monitoring of over 10 million blogs. Some of the statistics analyzed are representative of the database as a whole, including both bundled and unbundled feeds. The unbundled feeds are not categorized in any way, other than having been on lists of blogs that were validated and updated recently. As such, they are representative of blogging on the whole and not any particular field. This is certainly a useful collection of feeds to consider. However, sites such as Technorati monitor 10 million blogs and the number of blogs in existence has been estimated to be over 50 million [14], so this is a very small slice of all blogs. Analysis was performed while RAIn was still running, using the live database. For some queries, the live tables could be used. However, for others, especially those involving words, the live tables are too large to analyze in a reasonable amount of time. For the purpose of analysis, all of the data for the window being analyzed was exported into separate tables. This reduced the size of the words tables by more than 99%, making analyses possible in a reasonable amount of time. Even with these separate, smaller tables, some of the queries took more than 10 minutes to perform, while the queries to create these tables took several hours to complete. For the duration of our data gathering, RAIn was checking for stale feeds every 60 seconds and looking for a maximum of 400 stale feeds. 30 threads were available in the threadpool. The hardware was capable of supporting a higher number of threads, and thus processing a larger number of stale feeds per minute, but the number of threads was intentionally kept low to facilitate simultaneous crawling and querying. RAIn was constantly busy on a feed set of 78,096 feeds, making approximately 235,000 feed visits per day.
Analysis
We selected March 15th through March 31st, 2005, as our window to analyze. Unfortunately, there was an unexplained glitch and the kernel killed the crawler early on March 23rd. The crawler was restarted on the 25th, but there was some information missed as a result of this failure, and certain aspects of the results are atypical. This has been indicated where it affects the results.
17
Figure 3: Feed status per day
Figure 4: Disk usage per day Figure 3 shows the performance of the crawler in terms of how many feeds the crawler was able to visit each day, along with a breakdown of how the feeds were classified. 18
Figure 4 shows the size, in KB, that the database grew each day storing the information about these feeds. It is important to note that these numbers only pertain to feeds but database usage increases both in proportion to the number of updated feeds and the number of new items in these feeds. The number of new items per feed was unusually high on March 25th due to the performance problems on the previous days, thus the disk per feed ratio on that day is not representative of typical usage. These numbers were obtained from RAIns logs and are representative of the performance of the system as a whole, including both bundled and unbundled feeds. From them, one can obtain a feel for RAIns performance. From the total number of feeds, one can estimate that every feed in the database was visited approximately 3 times per day. The high numbers of updated feeds, coupled with RAIns constant activity, hints that RAIn was not able to catch updates as they happen, instead catching them hours after they occurred. The high numbers of unchanged feeds hint that the cap of 24 hours for the fetch interval may need to be increased. These numbers also highlight the glitch that resulted in the crawler being killed by the kernel, as seen by a marked difference in performance beginning on March 21st.
Figure 5: The number of updates per day per bundle
19
Figure 5 shows the average number of updates per day per bundle, normalized against the number of feeds in the bundle.2 One interesting result is that the Subscriptions A-list bundle has a much higher number of posts per day than the other bundles. Membership in the Subscriptions A-list bundle is based roughly on popularity, not number of updates. From this, it is possible to conclude that popular blogs are updated more frequently than other blogs. Another surprising result is that the number of posts per day for the Entertainment bundle is so low. In this case, it can be concluded that either entertainment blogs do not post all that frequently or the Entertainment bundle contained blogs that were not being updated during the analysis window. All of the numbers are slightly low due to the complications around the 24th, but they still demonstrate the relative frequencies between bundles.
Figure 6: The number of updates per day of week per bundle Figure 6 shows the average number of updates per bundle against the day of the week, normalized against the number of feeds in the bundle.2 Again, the numbers are slightly low due to the complications around the 24th. Unlike the previous graph, the relative frequencies are affected by the complications around the 24th, as the numbers for Saturday and Sunday are unaffected, the numbers for Monday and Friday only slightly affected and the numbers for Tuesday, Wednesday and Thursday significantly affected. Looking at the frequencies, it appears that there may be an interesting, if perhaps predictable story about updates and day of week that should be reexamined.
Normalization is performed by dividing the number of updates by the number of feeds in the bundle. 20
Figure 7: Date vs. time vs. frequency showing only the Sports bundle
Figure 8: Date vs. time vs. frequency showing only the Subscriptions A-list bundle
21
Figures 7 and 8 show the post density against date and time. The size of a bubble is representative of the number of posts in a ten-minute window. These graphs give an idea of the posting habits of a bundle. On the whole, posting is steady but there are definite peaks and valleys for some bundles. For example, post density for the Subscriptions Alist bundle is higher during the period from 11 a.m. to 7 p.m. and lower during the period from 12 a.m. to 6 a.m.. By contrast, the Sports bundle is very scattered, with very significant peaks throughout the day separated by large valleys. This window demonstrates fairly typical posting frequency. Deviations from the norm could be used to discover or pinpoint significant events. This information could be correlated with other information (e.g., popular words) to track the rise and fall of significant media events (e.g., the 2004 tsunami, the death of the pope). As with the previous graphs, these show some unusual behavior around the 24th, both sporadic behavior starting on the 21st and an increased density on the 25th.
Bundle Words per Body Computers & Technology 74.917 Entertainment 128.555 Eszter Politics 79.919 Politics 95.482 Sports 67.006 Subscriptions A-list 57.980 Bundle URLs per Body Computers & Technology 1.018 Entertainment 1.687 Eszter Politics 1.125 Politics 0.658 Sports 0.383 Subscriptions A-list 1.051
Tables 2 & 3: Words per body (left) and URLs per body (right) Tables 2 and 3 report average statistics for each bundle in terms of the length of items and number of URLS mentioned in the body. From these, it can be observed that the entertainment blogs observed are likely to be lengthy and contain links while sports blogs are likely to contain short posts without links. However, it is important to note that some blogging packages limit the length of RSS feeds, which may have affected these numbers.
Bundle Computers & Technology Entertainment Eszter Politics Politics Sports Subscriptions A-list Words per Body Factor 157.872 2.107 314.966 2.450 287.090 3.592 316.392 3.314 581.732 8.682 230.321 3.972
Table 4: Words per body containing URLs Table 4 shows the average number of words per body when the item contains one or more URLs. The 3rd column contains the factor between items containing URLs and
22
items that do not. In all cases, the average number of words is significantly higher when an item contains a URL than when it does not, almost nine times higher in the case of sports blogs. By comparing this information to the information in tables 2 and 3, we can conclude that a post that contains a URL is likely to contain multiple URLs. This shows that URLs are uncommon in almost all cases, appearing in somewhere between 10% and 30% of items, on average.
Bundle Computers & Technology Entertainment Eszter Politics Politics Sports Subscriptions A-list Local 1,690.536 752.025 687.449 1,100.242 638.540 523.641 In-Bundle 1,177.572 678.750 1,259.742 780.478 564.424 691.602 Out-of-Bundle 7,131.892 8,569.225 8,052.809 8,119.281 8,797.035 8,784.757
Table 5: Number of URLs by type Table 5 is an analysis of the webfeed_item_urls table, examining what people link to. Local links are either absolute links, pointing to the blogs host, in the case of known blog hosts, or the blogs domain, in the case of non-blog hosts, or relative links (i.e., links that do not contain http://). Blog hosts for this experiment were typepad.com, blogs.com, blogspot.com and blogdrive.com. In a small number of cases, these links are also nonhttp links (e.g., mailto). In-bundle links are http links pointing to other blog hosts contained in the bundle. Out-of-bundle links are all other http links. To facilitate comparison, all numbers have been normalized per 10,000 URLs.3 These numbers show a large degree of similarity across bundles, though there is a significant difference between Computers & Technology and the rest. From these numbers we can conclude that there is a difference in the behavior of technology blogs and others in terms of linking behavior. Also interesting is the significant difference in ratio between local links and in-bundle links in Eszter Politics and the rest of the bundles. This tells us that the Eszter Politics bundle is a tightly connected bundle, collecting blogs that relate to each other. Table 6 shows a small sampling of the words used by posts in the Politics bundle over a four-day period, including their relative frequency per 10,000 words2 and the change from the previous day. When viewed over a large window, it is possible to do a postmortem of certain events, tracking when they started to gain in popularity and when they faded into obscurity. The data could also be combined with Kleinbergs techniques for identifying bursts of words [9] to identify popular events as they are happening. It is important to note that while the data was filtered against a list of common stop words, some constant terms (e.g., said, has) remain. For the purposes of searching, some
3
Normalization is performed by taking the number and dividing by the total number of occurrences, then multiplying by 10,000 to find the frequency per 10,000 items; e.g., number of times said appears / total number of words * 10,000. 23
common terms should be included that are not useful for analysis. When looking for bursts, or when identifying lists of popular words, a two-pass approach would be best, first building the complete list from the database, then recomputing the list filtering out common words. One approach for filtering would be to track the change per word over time and to exclude words that have a small average change over a large window (e.g., exclude words with an average change of 2 or less over the past two months).
15-May Rank Word 1 said 2 has 3 about 4 who 5 were 6 will 7 would 8 all 9 one 10 people 11 more 12 been 13 out 14 no 15 what 16 so 17 up 18 if 19 like 20 can Freq +/Word Freq 68.936 63.208 43.713 39.694 38.488 35.875 33.363 31.956 30.348 30.147 29.645 29.444 27.333 26.730 25.123 23.816 23.414 22.309 22.108 21.806 56.943 New said 53.093 New has 46.736 New about 45.393 New who 36.440 New will 33.754 New been 33.038 New were 32.769 New all 32.142 New more 31.068 New people 30.531 New one 28.740 New out 28.382 New would 27.487 New up 27.218 New Bush 25.069 New what 24.263 New some 23.099 New Iraq 23.010 New if 22.920 New can 16-May +/- Word 0 has 0 said 0 who 0 about 1 will 6 would -2 out 0 one 2 people 0 been -2 were 1 all -6 more 3 up 11 Bush -1 like 4 if 33 what -1 some 0 into Freq 59.085 59.000 43.402 42.555 39.079 37.553 31.280 30.263 30.093 30.093 29.754 29.415 28.737 25.431 25.346 23.905 23.397 22.464 22.464 22.464 17-May +/- Word 1 has -1 said 1 about -1 who 0 will 7 one 5 would 3 been 1 more -4 all -4 out -4 were -4 if 0 people 0 up 12 can 2 our -2 what -2 some 7 other Freq 59.378 50.393 41.897 41.213 36.330 35.939 33.791 32.619 31.252 30.470 30.275 28.126 27.247 25.197 25.099 23.634 23.341 22.071 21.583 21.583 18-May +/0 0 1 -1 0 2 -1 2 4 2 -4 -1 4 -5 -1 14 26 0 0 7
Table 6: Most popular words per day for the Politics bundle
Table 7 shows statistics for the word Schiavo, including the rank among all words in the Politics bundle for that day, the relative frequency per 10,000 words and the change from the previous day. On the 15th, Schiavo was not ranked in the 1,000 most popular words. As the Schiavo case gained in national attention, the frequency exploded, peaking as the most popular word on the 22nd, as the courts and Congress debated her case. Yet her name remains one of the most popular words in the Politics bundle up until her death (and likely beyond). This pattern demonstrates both how a word will jump to the top of the list as a story breaks and also how significant events can be identified and their rise and fall tracked by looking back through the word lists.
24
Date 15-Mar 16-Mar 17-Mar 18-Mar 19-Mar 20-Mar 21-Mar 22-Mar 25-Mar 26-Mar 27-Mar 28-Mar 29-Mar 30-Mar 31-Mar
Rank NA 784 524 43 28 21 23 1 6 5 9 17 16 24 14
Freq NA 2.967 15.430 20.438 21.504 23.101 83.628 38.832 38.011 30.973 24.174 25.095 20.818 26.166
+/NA 260 481 15 7 -2 22 -5* 1 -4 -8 1 -8 10 2.211 New
* difference between 22nd and 25th
Table 7: Rank and frequency for the word Schiavo in the Politics bundle
Conclusions
With RAIn, we have created a framework for monitoring and analyzing RSS feeds. It is fairly lightweight, requiring only inexpensive hardware. The design is modular, allowing for the easy replacement of components to either support different functionality or improve performance on a given system. RAIn is a complete system, including feed discovery, retrieval, archiving, indexing and a querying interface. It can be pointed at any site with an RSS feed and will enable archiving of the sites RSS and provides the ability to search for items based on keywords. More complex queries can be performed to generate statistical information, either about a site or a group of sites. We also described some of the statistical analyses possible using RAIn. These range from simple metrics such as update frequency to complex analyses of the content of the items contained in feeds. For analysis, several bundles were defined. Each bundle contains a handpicked set of RSS feeds representing a particular blogging community. We were able to generate several different sets of statistical information about the blogs contained in the bundles as well as the other 77,373 blogs in RAIns database. Based on our experiences with RAIn, the system proved to be very capable. Inexpensive hardware supported processing more than 200,000 feeds per day. More expensive hardware, or a cluster of inexpensive hardware, should be capable of processing a significantly larger number of feeds. Despite claims that there may be more than 50 million blogs worldwide, it is likely that there are significantly fewer that are actively 25
updated. A larger RAIn installation may be able to compete with sites like Technorati and Syndic8, monitoring a significant percentage of the active blogs in the world. Building upon the statistics, more complex analyses are possible. RAIn could easily be used as the basis of a word monitoring system. Simple word-burst techniques could be applied to watch for sudden changes in a word, finding significant events as they are happening. In times of crisis, the Internet has proven to be the fastest source of news and information time and again. By actively monitoring RSS feeds, it may be possible to become aware of significant events before they hit the mainstream media. RAIn also accumulates a substantial amount of content from blogs. On top of the statistical indexing methods currently in use, full-text indexing methods could be applied to create a blog search engine. Combined with existing search technologies like Lucene, Nutch or XTF, the content could be easily indexed and searched, providing a very substantial searchable archive of blogs. RAIn proved to be very capable, monitoring a significant number of feeds on inexpensive hardware. More important, RAIn proved to be very flexible and adaptable. As configured, a large number of statistical analyses can be performed on RAIns data. However, with RAIns raw data, any statistical analysis possible can be performed; one need only write the module to do it.
26
References
[1] Apache Software Foundation, The. Apache Lucene. May 10, 2005. <http://lucene.apache.org/> [2] Barr, Jeff and Bill Kearney. Syndic8. May 10, 2005. <http://www.syndic8.com/> [3] Lemburg, Marc-Andr. mxDateTime Date and Time types for Python. May 10, 2005. <http://www.egenix.com/files/python/mxDateTime.html> [4] Goodnough, Abby and Maria Newman. Supreme Court Rejects Request to Reinsert Feeding Tube. New York Times March 24, 2005. May 10 2005. <http://www.nytimes.com/2005/03/24/politics/24cnd-schia.html> [5] Hastings, Kirk and Martin Haye. XTF (eXtensible Text Framework). May 10, 2005. <http://xtf.sourceforge.net/> [6] Jacobsen, Kjetil and Markus Oberhumer. PycURL Home Page. April 6, 2005. May 10, 2005. <http://pycurl.sourceforge.net/> [7] Khare, Rohit, Doug Cutting, Kragen Sitaker and Adam Rifkin. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs Technical Report #04-04. May 10, 2005. <http://labs.commerce.net/wiki/images/0/06/CN-TR-04-04.pdf> [8] Klam, Matthew. Fear and Laptops on the Campaign Trail. New York Times September 26, 2004. May 10, 2005. <http://www.nytimes.com/2004/09/26/magazine/26BLOGS.html?ex=1253851200&en=8 b59680f1bd93479&ei=5090> [9] Kleinberg, Jon. Bursty and Hierarchical Structure in Streams. Proceedings of 8th SIGKDD July 2002. [10] Libby, Dan. RSS 0.91 Spec, revision 3. July 10, 1999. May 10, 2005. <http://my.netscape.com/publish/formats/rss-spec-0.91.html> [11] Mueller, Martin. The WordHoard Project. April, 2005. May 10, 2005. <http://bistro.northwestern.edu/AnaServer?WordHoardGuide+0+frame.anv> [12] Pilgrim, Mark. Universal Feed Parser. May 10, 2005. <http://www.feedparser.org/> [13] PostgreSQL Global Development Group. PostgreSQL: The worlds most advanced open source database. May 10, 2005. <http://www.postgresql.org/>
27
[14] Riley, Duncan. Number of blogs now exceeds 50 million worldwide. The Blog Herald April 14, 2005. May 10, 2005. <http://www.blogherald.com/2005/04/14/numberof-blogs-now-exceeds-50-million-worldwide/> [15] Sifry, David et al. Technorati. May 10, 2005. <http://www.technorati.com/> [16] van Rossum, Guido et al. Python Programming Language. May 10, 2005. <http://www.python.org/> [17] Winer, Dave. RSS 2.0 Specification. January 30, 2005. May 10, 2005. <http://blogs.law.harvard.edu/tech/rss> [18] Winstead Jr., Jim. blo.gs. May 10, 2005. <http://blo.gs/>
28

Tech Report NWU-CS-05-08: A System For Indexing and Archiving RSS Feeds

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tech Report NWU-CS-05-08: A System For Indexing and Archiving RSS Feeds

Uploaded by

Copyright:

Available Formats

Computer Science Department

The CurlFetcher and SharedCurlFetcher

Figure 2: An entity-relationship diagram for RAIns database

Design Decisions and Lessons Learned

Experiences Using RAIn

Figure 3: Feed status per day

Figure 5: The number of updates per day per bundle

Rank NA 784 524 43 28 21 23 1 6 5 9 17 16 24 14

+/NA 260 481 15 7 -2 22 -5* 1 -4 -8 1 -8 10 2.211 New

* difference between 22nd and 25th

You might also like