The early implementations of
Wpoison raised
of couple of very valid safety concerns. These were expressed
to the author by early users of
Wpoison and they
have now been largely eliminated.
Two problems, in particular, were obvious from the beginning.
The first problem was the potentially bad effects that
Wpoison might
have on legitimate web crawlers, such as those used by the
major web search engine companies, and the related secondary
negative effects which those primary negative effects might have
on any
Wpoison
user site which had high hopes of being properly (and prominently)
cataloged by the major search engine companies.
This problem was trivially eliminated by including code in
Wpoison which
causes each
Wpoison-generated
randomized web page to carry a clear indication (for the benefit
of legitimate web crawlers) that the page in question should
not be cataloged in any way. Basically,
Wpoison now
merely makes proper use of the (pre-existing)
Robot Exclusion Protocol.
Use of this protocol, and its associated
``off limits'' markers, within all web pages generated by
Wpoison serves to
insure both that (a)
legitimate[1]
web crawlers will not get all caught up in repeatedly reading
thousands (or millions) of randomized garbage pages generated by
Wpoison and that
(b) the legitimate search engine companies will still be able to
successfully add your web site to their data bases.
The second problem was the potentially bad effects that
having a locally installed copy of
Wpoison might
have on one's own CPU and bandwidth usage. Obviously, given
the nature of how
Wpoison actually
works, it can easily be seen that (unless something is done
to prevent it) the evil spammer
address harvesting web crawlers may get trapped by
Wpoison (as intended)
but that then, they might begin to access your installed copy of
Wpoison over and
over again (as intended) perhaps even to such an extent that they
end up using up most/all of your available CPU cycles and/or
most/all of your available network bandwidth.
This problem also was solved in a fairly trivial and straightforward
way. In a nutshell,
just prior to the time it generates the very tail end of any
one of its randomly-generated pseudo web pages,
Wpoison pauses
for several seconds. It just does nothing (other than wasting time)
during those several seconds.
The effect of these calculated pauses is that they insure
that any
address harvesting web crawlers that may be diligently
attempting to suck as many
Wpoison-generated
web pages out of your site as fast as possible will in fact only
be able to suck pages out at a reasonable and moderate pace which
will not have any sustained dramatic effect upon your
CPU usage or network bandwidth, and yet still fast enough so that
if one of these spammer
address harvesting web crawlers is left to try to digest
your entire web site, say, overnight, then within a few hours
(and certainly by morning) its data
base of e-mail addresses will have been well and throughly
polluted by millions of utterly bogus e-mail addresses,
just as we would like.
The bottom line is that sites can now safely install and run
Wpoison without any
fear that doing so may cause sudden large drains of CPU cycles or
network bandwidth. It won't. Period. End of story.
[1]
It is important to understand the distinction between
legitimate web crawlers and the rather different ones
that the spammers use. Legitimate web crawlers, such as those
used by the major search engine companies do
always obey the standardized and widely accepted
Robot Exclusion Protocol, and they take its use, on any given web page,
as a clear and unambiguous ``keep out'' sign. Spammers who are
trawling for e-mail address on the other hand have no incentive
whatsoever to skip any web pages that might contain valuable
fresh e-mail addresses, so the
address harvesting web crawlers that they use tend to
totally ignore the established standards of good practice on the
net, basically ignoring all posted ``keep out'' signs and
blundering recklessly ahead even when they have been warned
that there is no data of any permanence or interest on the
page or pages ahead. In fact it is this reckless behavior that
Wpoison relies upon.
By being stupid, brutish, and un-careful, spammers play right into
our hands!
It should be noted however that since the development of the first
publicly-released version of
Wpoison, spammers
have been starting to catch on to the fact that their own stupidity
and greediness in reading all web pages, even when they have
been warned off, was in fact causing them more harm than good.
Because of this the author of
Wpoison now believes
that many (and perhaps even a majority) of the spammer's
address harvesting web crawlers have now been reprogrammed
so that they now do obey the standard
Robot Exclusion Protocol. This actually represents a sort of
victory for those of us who do not want to have our e-mail addresses
harvested by the spammers, because now we can gain a measure of
protection from the spammer
address harvesting web crawlers simple by arranging to have
all e-mail addresses that are displayed on our web sites appear
only on pages that we have marked as un-scrapable (for
robots) via the standard
Robot Exclusion Protocol.
The author of
Wpoison nowadays
strongly advises (to all who will listen) that all
web pages containing real e-mail addresses should in fact be
marked as being ``off limits'' via the standard
Robot Exclusion Protocol, both now and into
the foreseeable future. Doing that alone now provides a measure of
protection from having your address harvested by spammers, all by
itself (and without even having
Wpoison anywhere in
the picture).