Automatic Retrieval and Normalization of Text data from the Internet

Description

In this project, the student is required to design and write a program which can automatically search, retrieve, and normalize text data from the Internet according to user's queries. The program needs to process each of user's text queries and then search the processed query from some Internet search sites and automatically retrieve and save the returned search results. At last, the program needs to normalize and format the returned text and save them for future uses. Since the program eventually will be used to automatically process a large number of user queries, its efficiency is a key issue in its design and development.

Description retrieved from 2008 list of available NSERC projects

Information

Details

Note: the following text below only gives a very condensed summary of the project. The report can be accessed on request.

The program is split into three major components.

The first component retrieves web pages from the Internet based on an input query file. This file contains queries, one on each line, to send to a search engine. To improve the search results, stop words (such as "the", "an", etc.) are removed from the query. For each query, the program requests the search engine return 30 to 50 web pages (this setting can be customized) and downloads each webpage.

Control passes to the second component. Because webpages are not always written in such a way that a DOM tree can be generated, each webpage gets cleaned up by a third-party Java program (in this case, I chose HtmlCleaner). Once the webpage is represented in a DOM tree, the program passes each webpage through a set of filters to remove unnecessary clutter (image tags, navigation areas, etc.). This is similar to how the CRUNCH proxy works.

At this point, each web page has been reduced to a chunk of plain text. The last component handles text normalization which, at the most basic level, involves sentence splitting and handling non-standard words (NSW) such as abbreviations and numbers. Research papers in this domain show that the text normalization problem is commonly found in text-to-speech synthesis; as such, the project leverages the work of the MARY text-to-speech system.