Overview of the Internet and the Services Available
The Internet connects many different types of computers providing a variety of services to other computers. Those providing services are generally referred to as servers, while those requesting services are generally referred to as clients. Examples of the services provided on the Internet are web services provided through the Hyper Text Transfer Protocol (HTTP), email provided through the Post Office Protocol, Gopher, and Wide Area Information Servers (WAIS).
Any of these services may be used to provide markup language text to a client. The term “markup language” is used to refer to any type of formatted content, such as content using tags for formatting and/or organization. “Markup language text” refers to any content formatted in a particular markup language. One example of a markup language that is widely available on the Internet is the Hypertext Markup Language (HTML). Servers that provide HTML are generally called web servers and the HTML they provide are called websites. However, computers that provide other types of markup languages such as Wireless Markup Language (WML), Extensible Markup Language (XML) or Mathematical Markup Language (MathML) are sometimes also referred to as websites. The types of content described above, HTML, WML and XML, are only examples of the different types of markup languages available. Many other types also exist and new types continue to be developed for new applications and new devices.
Users are spending more time on the Internet performing more and more activities from online shopping to banking; meanwhile, Internet sites are getting more complex in design and content. For example, one common way of performing activities on the Internet is through webpages, which are HTML pages provided by a server. Websites are simply a collection of webpages, and the term website can also be used to refer to a collection WML, XML or any type of markup language text provided by a server.
Problems Associated with Current Websites
Websites are becoming more cluttered with guides and menus attempting to improve the user's efficiency, but instead these guides and menus often end up distracting from the actual content of interest. These “features” may include script- and flash-driven animation, menus, pop-up ads, obtrusive banner advertisements, unnecessary images, or links scattered around the screen.
These features have caused the gap between the usability of the web for persons with disabilities vs. persons without disabilities to grow ever wider. Many of these technologies were designed to better the web experience for sighted users, including script- and flash-driven animation, pop-ups, banners, and of course, images. While some users may find these features effective, they may make websites less accessible to users with disabilities. The World Wide Web Consortium (W3C) has created a set of guidelines, the Web Accessibility Initiative, to assist web developers in creating sites that are accessible to all.
As an example, FIG. 5 shows a typical sports webpage from CNN Sports Illustrated. It not only contains the article 5020 (the text on the left of the screen), but also has a number of clutter elements like the advertisement 5040 on the right, the horizontal banner ad 5010 immediately under the logo and the advertisement links 5030, below the image that is related to the article. There are several corporate logos identifying the site, as well as ones for the web page. There are also several elements intended to help with navigation of the site itself and while there are no menu bars (vertical or horizontal) in this example, such menu bars are found on many webpages.
On websites such as shown in FIG. 5, speech rendering via screen readers, used by visually impaired users trying to access web pages, often end up reading the raw HTML rather than the content between them. The problem worsens with handheld devices where precious bandwidth and time may be wasted on downloading and then rendering the clutter which the user is likely to scroll past without reading.
Cluttered websites is a serious issue because the number of visually impaired web users (and computer users in general) is expected to increase dramatically as the population continues to age. For example, it is estimated that the number of Americans over the age of 65 will double between 2000 and 2040. In 1997, the United States Census Bureau estimated that there were 7.7 million adults with “non-severe visual limitation,” which was defined as “difficulty with seeing words and letters, even with eyeglasses,” and 1.8 million American adults with “severe visual limitation,” which was defined as the “inability to see words and letters, even with eyeglasses”. Persons with even minimal visual impairment are likely to encounter problems in everyday life. For example, people with vision worse than 20/40 cannot obtain an unrestricted driver's license in most states and may require assistive devices such as magnifiers for reading.
Overview of Content Extraction
One solution to this problem of cluttered websites that are inaccessible to disabled people is context extraction and content reformatting. A common reformatting practice for improving webpage accessibility for the visually impaired is to increase font size and decrease screen resolution; however, this also increases the size of clutter, reducing efficiency.
Another solution for making websites more accessible is screen readers for the blind. Screen readers convert the visual content of a webpage into audible content so that a user can hear it. However, these screen readers generally do not remove clutter from websites and often read out raw markup language text. Content extraction allows screen readers to process only the extracted content, instead of using either cluttered data from the web, or writing specialized extractors for each web domain.
The automatic extraction of useful and relevant content from webpages has many other applications in addition to assisting visually disabled users. These applications include enabling end users to access the web more easily over constrained devices like PDAs and cellular phones, providing less noisy data for information retrieval and summarization algorithms, and generally improving the web surfing experience.
Traditional approaches to removing clutter or making content more readable include removing images, disabling JavaScript, etc., all of which eliminate a webpage's original look-and-feel. Many of the products applying these approaches also rely on hardcoded techniques for certain common webpage designs as well as fixed “blacklists” of advertisers. These hardcoded techniques are inflexible and cannot easily be applied to websites they were not hardcoded for or to websites that have undergone structural changes.