Spider webs, bow ties, non-stop networks and the Deep Web

The World Wide Web conjures up images of a giant spider’s web where everything is connected to everything else in a random pattern and you can get from one edge of the web to another simply by following the right links. Theoretically, that’s what makes the web different from the typical index system: you can follow hyperlinks from one page to another. In the “small world” theory of the web, each web page is thought to be separated from every other web page by an average of about 19 clicks. In 1968, sociologist Stanley Milgram invented the small world theory for social networks by pointing out that every human being was separated from every other human being by only six degrees of separation. On the Web, the small world theory was supported by initial research on a small sample of websites. But research conducted jointly by scientists from IBM, Compaq and Alta Vista found something completely different. These scientists used a web crawler to identify 200 million web pages and follow 1.5 billion links on these pages.

The researcher discovered that the web was not at all like a spider’s web, but more like a bow tie. The bow tie web had a “strong connected component” (SCC) made up of some 56 million web pages. On the right hand side of the bow tie was a set of 44 million pages of OUT which could be obtained from the center, but from which you could not return to the center. OUT pages used to be corporate intranet and other website pages that are designed to catch you on the site when you land. On the left side of the bow tie was a set of 44 million IN pages from which you could get to the center, but you couldn’t travel from the center. These were newly created pages that hadn’t yet linked to many hub pages. Additionally, 43 million pages were classified as “tendril” pages that did not link to the center and could not be linked from the center. However, tendril pages were sometimes linked to IN and/or OUT pages. Occasionally, the tendrils joined each other without passing through the center (these are called “tubes”). Finally, there were 16 million pages totally disconnected from everything.

Research by Albert-Lazlo Barabasi at the University of Notre Dame provides further evidence for the non-random and structured nature of the Web. Barabasi’s team found that far from being an exponentially exploding random network of 50 billion web pages, activity on the Web was actually highly concentrated in “highly connected supernodes” that provided connectivity to less connected nodes. Barabasi called this type of network a “scale-less” network and found parallels in the growth of cancers, disease transmission and computer viruses. It turns out that scale-free networks are very vulnerable to destruction: if their supernodes are destroyed, the transmission of messages is quickly interrupted. On the plus side, if you’re a seller trying to “spread the word” about your products, put your products on one of the super nodes and watch the news spread. Or create super nodes and attract a huge audience.

Therefore, the picture of the web that emerges from this research is quite different from previous reports. The notion that most pairs of web pages are separated by a handful of links, almost always fewer than 20, and that the number of connections would grow exponentially with the size of the web, is not supported. In fact, there is a 75% chance that there is no path from one randomly chosen page to another. With this knowledge, it is now clear why the most advanced web search engines only index a very small percentage of all web pages, and only about 2% of the general Internet server population (about 400 million). Search engines cannot find most websites because their pages are not well connected or linked to the central core of the web. Another important finding is the identification of a “deep web” made up of more than 900 billion web pages that are not easily accessible to the web crawlers used by most search engine companies. Instead, these pages are either proprietary (not available to crawlers and non-subscribers) like the pages of (the Wall Street Journal) or not readily available on web pages. In recent years, newer search engines (like the Mammaheath medical search engine) and older ones like yahoo have been revised to search the deep web. Because e-commerce revenue depends in part on customers being able to find a website using search engines, website managers must take steps to ensure that their web pages are part of the connected central core or “supernodes” of the Web. One way to do this is to make sure the site has as many links as possible to and from other relevant sites, especially other sites within the SCC.

Spider webs, bow ties, non-stop networks and the Deep Web

admin

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta