Resource Discovery on the WWW
Information retrieval on the Web is like searching for a needle in a haystack: one needs the right tools to separate the needle from the hay.In this project we develop tools that help users identify and track the information they are interested in.
When a user wants to find information about a specific topic he/she sends a query to a a search engine (e.g. Alta-Vista), which replies with several URLs. Every time the user wants to find new information about the same topic, Alta-Vista returns the same URLs, flooding the user with unecessary information. USEwebNET is designed to relieve users form the long waits and the information flood associated with the traditional search model. Specifically, USEwebNET is a network tool with a user-friendly interface designed to retrieve documents about selected subjects(or updated versions of selected documents) from the web and present them to the user with various information about them, according to the user's preferences.
In a daily basis, USEwebNET contacts several search engines selected in the user's preferences (currently Yahoo, Alta-Vista and Hot-Bot are supported) and downloads all documents that match the specified keywords and have not been downloaded during the previous days.
For each user, USEwebNET keeps a database with his preferences. These include the search engines, from which documents are going to be retrieved, the keywords that are going to be used for a search and the time period, after which documents that have not been read by the user are considered NOT valid and are deleted. USEwebNET keeps track of the documents that have been read by every user. Thus, each user is provided only with new documents every time he accesses USEwebNET and so he focuses only on new or updated pages.
PaperFinderScientists always need to stay informed about developments in their field. In order to do so, they subscribe to scientific journals, participate in conferences and workshops, collaborate with colleagues, browse publications available at libraries, etc. To narrow down the information they receive, scientists follow closely papers that appear only at a small subset of journals of conferences. However, the number of scientific publications increases rapidly. An increasing number of people deal with intellectual activities, which implies that more papers are being produced and published than even before. The invention and spread of the World-Wide Web made the process of paper publication significantly easier than before, and added a large repository of on-line (electronic) papers to our body of knowledge. This increasing number of printed and electronic papers makes it increasingly difficult for a single person to keep up with all the relevant information that (s)he might be interested in. Simply put, there are too many sources of (potentially) useful information, many more than any single person has the time to track. It would be very useful if there were a tool that could filter all the available sources of information and deliver only useful papers to interested scientists. In this project we propose to develop PaperFinder, a tool that continually searches digital libraries of scientific publications, filters only the relevant papers, and delivers them to interested scientists through a friendly user-interface. The PaperFinder works in two modes: the keyword-based mode, and the resource-discovery mode.
In the first mode, interested users supply PaperFinder with a few keywords that describe their field of interest, like ``digital libraries'', or ``process scheduling''. Along with these keywords users specify a number of on-line digital libraries that PaperFinder should search for papers. Then, PaperFinder inquiries each digital library for papers matching the above keywords. All replies are merged and presented to the user via a USENET-based interface. Once the user views some papers, PaperFinder marks them as ``read'' and does not present them to the user the next time. Thus, users can focus on ``new'' papers that they have not previously seen. Users may also select to ``save'', or ``delete'' a paper. Thus, users are always presented with new papers, that they have not processed before.
In the resource-discovery mode, PaperFinder sets out to discover papers that may match a user's interest, but which do not necessarily match some predefined keywords. In this mode, users specify some ``seed papers'' (or ``seed authors''), and PaperFinder searches the digital libraries to find similar papers to these ones. Defining the best similarity metrics is an open and interesting issue. We favor the use of simple metrics that can be easily calculated. For example, papers that have a similar set of references may be close to each other. As another example, papers that have an overlapping set of co-authors, or several common keywords in their title/abstract may also be similar.
Availability:The code can be downloaded from here.