How many Web Sites are there?
- One Million Web-site names are in common usage.
- There are about 450,000 unique host machines.
- If you request the top page from these 450,000, about 300,000 will return
one within reasonable time. The rest appear to be sporadic or obsolete.
- About 95% of the 300,000 servers are "up" at any given time.
How big is the Web?
There are an estimated 80 million HTML pages on the public web as of January
1997. This figure is fuzzy because some sites are entirely dynamic (a database
generates pages in response to clicks or queries). The typical Web page has 15
links (HREFs) to other pages or objects and 5 source objects (SRC), such as
images or sounds.
- The typical HTML page is 5 KB.
- The typical image (GIF, or JPEG) is 12 KB.
- The average object served via HTTP is 15 KB.
- The typical Web site is about 20% HTML text, 80%
images, sounds, and executables (by size in bytes).
The upshot of this data is that it takes about
400 GB (gigabytes) to store text of a
snapshot of the public web and about 2000 GB (2 terabytes)
to store non-text files.
How big are individual Web sites?
- The median size for a Web site is about 300 pages; only 50 sites have more
than 30,000 pages.
- About 5% of all servers have a robot.txt file (for governing how crawlers
visit).
- About 1% of all servers have a sitelist.txt file (to aid site mapping and
robot revisiting).
How fast is the Web growing?
- The size of the Web is doubling yearly, but this statistic is loosing its
meaning because of the growth of dynamic sites.
- The typical Web page is only about 2 months old.
- Dynamic sites are becoming a significant presence; JavaScript is
widespread, Java much less so, but growing.
How do surfers use the Web?
- The typical user downloads around 70 KB of data for each page visited.
- The typical user visits 20 web pages per day.
- One percent of all user requests result in "404, File Not Found"
responses.
- The 1000 most popular sites (out of 300,000) account for about half of all
traffic.
1997 statistics are comprised
of various resources on the web.