Information That Cannot Be Found in Google
Not all information can be found using Google, and not even all the information
available on the Web can be accessed with Google.
From Chapter 10, here's some information about what can and can't be found
using Google, and some places to go for information not in Google.
Not in Google
There are large parts of the World Wide Web that no search engine—including Google —can “see.” It’s hard to give a good definition of the invisible Web (which is also sometimes called the deep Web and dark matter). The best way to think of it is simply as material that is on the Web that has been excluded from search engines, specifically from Google, either on purpose or due to technological limitations.
Material on the Web that is invisible to Google will almost certainly be invisible to other search engines as well.
It’s easy to see why some Web sites—such as those with Adult content—might be excluded from Web search engines and thus rendered “invisible.” But it may be a little harder for you to understand why sites that contain information of value to researchers are also invisible.
There are a number of possible reasons that Web pages might be excluded from Google (and other search engines). These include:
- Dynamic results aren’t easy to read: If a page is dynamically generated and assembled from a database, Google might not return it as a result to your query. Although spiders can access dynamically generated pages, particularly if a page is pulled intact out of a database, and even the returns page from a Google search can be considered a dynamically generated page, spiders can have trouble with any dynamic generation that involves setting multiple fields to return the results.
- Membership has its privileges: If you’e required to log in to access a page and/or a subscription or fee is required to access the page (see “Premium and specialized online services”), the results may not come up.
- The page is flying solo: If the Web page is “disconnected,” with no other pages linking to it (so it is hard for spiders to find), it won’t show up high on your list of results.
- The page doesn’t have words: If a Web page contains mostly non-text matter, such as imagery, indexing may be limited to ancillary text such as that in alt parameters of the img tags.
- Part of a site is available, but not the good stuff: Depending on the site and the information it contains, a Webmaster may mark specific pages as off-limits to crawlers.
- The information is in a format that can’t be easily read: If a file is in a format that is hard for the spider to read, such as an executable file, or a compressed file (suhs as a .zip file ), Google may not be able to find it. Google (unlike most other search engines) does have an impressive ability to index Acrobat (PDF), Postscript files, and Microsoft Word documents.
Finding Information Not in Google
For a researcher, the most important part of the invisible Web is made up of fee-based premium services that provide high-quality information. The information provided by these services may be stored in some kind of database, but to the researcher it hardly matters so long as the service makes a Web interface to the data available.
Some of the best-known online fee-based research services are:
- DataStar, a professional research service with an emphasis on companies and industry
- Dialog, an extensive research service that makes more than 600 research databases available either through dedicated software or the Dialog Web site
- Factiva, an extensive research database with a focus on business and finance and current events
- LexisNexis, perhaps the best known research service, featuring a wide variety of databases covering business, news, and legal affairs
- Questia, an extensive library of books and periodicals, primarily in the social sciences
- Westlaw, an online legal research service that provides access to statutes, case law, public records, and other legal content.
|
|
Search Engine Optimization
 
Syndication Viewer
Our Web host:
IX WebHosting
|