Here we go. It's finally done. Hope you enjoy this. Note that this isn't meant to make you a rich spammer king with your own harem of whores and private drug vault. But it just might get you started! Maybe.
---------------------------------------------------------------------------------------------------------------------------
ANATOMY AND CONSTRUCTION OF SPAMBOTS
A CONCISE PAPER ON SPAMBOT TECHNOLOGY, COUNTERMEASURES AND THEIR RESPECTIVE SHORTCOMINGS
vezzy of evilzone.org
INTRODUCTION AND DEFINITIONS
Before we begin to discuss methods of operation, classifications and technical workings of spambots, let us first define what a spambot actually is. Wikipedia defines a spambot as such: “A spambot is an automated computer program designed to assist in the sending of spam/” Spambots usually create fake accounts and send spam using them.”
This is obviously a very elementary and surface definition of a spambot, but it is essentially true. In more detail, a spambot is an automated script, usually a subclass of the common web crawler/scraper that is used to facilitate the sending of spam through a plethora of means, which vary depending on the type of applications it specializes in probing, harvesting information from and sending spam to. Due to their nature, they ignore web meta conventions such as robots.txt and no-follows.
Spambots can range from basic crawler scripts specialized for navigating through and posting spam to a particular website, to vast and highly advanced black hat search engine optimization suites employed by amateur and professional spammers and spamdexers alike for sending of spam messages on a massive scale across a multitude of platforms and protocols. An example of the latter is the popular proprietary program known as Xrumer, which has been in continuous development since its inception in 2006 and is a standard favorite among spammers to this day.
The reasons for writing and using spambots differ among people, but the practice is firmly associated with black hat search engine optimization (SEO), also known as spamdexing, which is the use of immoral, obtrusive and sometimes illegal techniques to generate ad revenue and higher relevance on search engines for personal benefit of the spammer. These techniques include link farming, duplicate scraper sites, meta tag stuffing, hidden links, spam blogs, excessive link building and mass spam messaging. Many of these techniques can be automated with advanced spamware, but of particular note and prevalence are the latter two. Indeed, “link building” is often erroneously used by spamdexers as a sly euphemism for their true actions, even though the actual practice is legitimate. Spambots are designed to automatically send out large quantities of spam, but can also be programmed to place links in strategic places (forum signatures, blog comments, etc.), which is a form of black hat link building.
Spambots are not limited to email and web applications. They are also encountered in online games, where they are used for griefing or to put strain on a server as part of a denial-of-service attack, or in IRC networks to crapflood channels, although the latter two are tightly related to the more specific and openly malicious programs that are denial-of-service (DoS) bots.
Spambots should be differentiated from zombie computers hooked to a botnet, since spambots are stand-alone programs, plus malicious botnets of this type also put high emphasis on distributed denial-of-service (DDoS) attacks, malware and recruiting more zombies, which is distinct from the predominantly black hat SEO edge of spambots. Nonetheless, zombies are frequently used to send email spam in particular.
Depending on the complexity of the spambot, the method to generate messages may simply involve hardcoded text, cycling through several values of hardcoded text (from an array, some other data container or a text file), or in more advanced cases, through the use of Markov chains. Some spambots may even scrape text from web pages for later use.
The creation of spam and anti-spam techniques has proven to be a perpetual arms race. New anti-spam solutions keep appearing and older ones are in the constant process of refining themselves, whereas on the other side experienced and dedicated spammers are always analyzing the inner workings and mechanics of these techniques, and implementing methods to recognize and bypass them. Therefore, this paper is by no means a complete overview, but it still serves to fill a void in research of spambots and provide useful information to an interested reader.
The code examples given here will be in Python 2.7, using the standard library and the third-party Splinter web browser abstraction layer. Not all examples are necessarily suited as stand-alone programs, and may only be demonstrative of the underlined concepts.
TAXONOMY
Email spambotsThe most basic and prevalent class of spambots. They typically work by scraping websites for email addresses (through regular expression or whatever other means), compiling these addresses into a list and then sending hardcoded or user input-specified spam messages to the list of recipient addresses via built-in SMTP client session functionality, which is usually included in the standard library of most high-level programming languages. They may be rudimentary scripts that use a hardcoded email address specified by the spammer, or contain more advanced features such as creating addresses on the fly, spoofing email and MIME headers, sending attachments and even trying to crack email accounts through and hijack them for usage in spamming activity, or retrieve already cracked accounts from websites utilizing advanced search operator strings, commonly known as dorks.
Forum spambotsBy far the most sophisticated spambot class, and also the most popular as of recent. These spambots have evolved quite significantly and quickly from their early incarnations. What started off as simple tools to interact with newsgroups or basic message boards which had no effective anti-spam measures have evolved into powerful software that can navigate across and register to many types of forum software, make use of private messaging and user control panel settings, recognize and get past most forms of anti-spam, such as hidden trap forms through HTML/CSS parsing, CAPTCHA - through optical character recognition (OCR) for images and pattern analysis or parsing of natural language (basic computational linguistics) for text.
Advanced forum spambots may also include utilities such as search engine parsers to retrieve large amounts of links by user-specified keywords or instructions, compiling them into lists or databases and then proceeding to try and attack them all in a multi-threaded fashion. An example of this is Hrefer, a complementary tool to Xrumer.
These spambots either use hardcoded emails in their primitive forms, or they can be coded to open email addresses on the fly and navigate throughout webmail interfaces to validate forum registrations. They can also be trained to post in a fashion that evades flood control and more complex ones have predefined behaviors for different forum software, which can either be set manually or programmed so that the bot can forensically examine the software type (e.g. looking for the word “vBulletin” in the index page, which is likely to be in the footer) and switch to subroutines optimized for that.
It is worth noting that the common spambot is unable to parse JavaScript, and as such plenty of forum spam defenses involve hiding forms with it. More sophisticated bots, such as those built on top of browser abstraction layers, can. Most common bots don't need to do this, though, as they work by crafting HTTP POST requests, either by analyzing form names or by consulting precomputed data structures of form names for different forum software, rather than filling out the forms themselves. There have also been some bots that exhibit behaviors such as account lockout attacks (sometimes inadvertently) and POST form cracking to retrieve user credentials rather than creating their own, but these are rare and often inefficient in the latter case.
Forum spambots are divided into two subclasses: playback bots are those that send strings of replayed POST parameters to the form submission URL and form-filling bots are those that actually read, recognize and submit filled out HTML forms.
Blog and guestbook spambotsSpambots that target blogs and guestbooks tend to be hybrids of the first two classes of spambots, and are often form-filling. Functionality to target blogs and guestbooks is bundled in addition to message boards for more advanced forum spambots, or bots can be specifically written to target these platforms.
Blog and guestbook spambots can have an amalgamation of features, but in general they are much simpler to write than forum spambots due to how commenting systems for such sites are implemented in most applications: plain forms with no anti-spam methods, most often just a screen name, email address, optional website and the comment itself.
Predeveloped comment and discussion widgets such as Disqus only require a comment and sign-in from a Disqus or social networking profile, which makes it theoretically easy to send spam. However, due to Disqus being a JavaScript application, the effectiveness of the spam and link building is limited, and specific code needs to be written for interfacing with it. Nevertheless, for most blogs and guestbooks, spam is abundant and trivial to automate, with spammers always being highly interested in targeting sites of that nature.
Social network spambotsThese bots are specifically coded to interface within the boundaries of social networking websites. The most common example is that of Twitterbots, which as the name implies, operate within the Twitter microblogging platform.
Generally most large social networks provide APIs or have third-party APIs written for interfacing with the application, making these bots relatively trivial to write. Twitterbots, for instance, are often written as a recreational activity and not for deliberate spamming. They can arbitrarily follow users, reply to tweets containing certain keywords and post tweets of their own.
Social network spambots are perhaps the least threatening class of spambots, although they're still a nuisance and can potentially be used to generate large interest in the spammer's links if implemented well enough. Depending on the moderation effort of the given social network, they can be removed fairly quickly or remain for prolonged periods of time in the user base, churning out messages.
IRC spambots/automated floodersThese are essentially regular IRC bots, but regimented for the sole and minimalistic purpose of flooding channels with spam, usually regular crapflooding (repeatedly sending large and vapid messages).
Most IRC flooders aren't typically designed to aid in black hat SEO, but rather just to launch denial-of-service and bandwidth exhaustion attacks for amusement or out of spite. The common methods of CTCP PING, DCC request, nick and connect floods have no real use for black hat SEO and spamdexing.
Nevertheless, IRC bots are used to send advertising spam as well. In addition, they are trivial to write in any language that supports Berkeley sockets out of the box, plus there is the added advantage of easily being able to connect using multiple program-controlled clients (clones) to carry out the spam, as well as tunneling through open proxies (SOCKS, HTTP, etc.) for extra effectiveness.
IRC spambots and flooders are still perhaps the easiest class of spambots to mitigate and control.
Game spambotsThese bots are most often trivial applications that are used to send spam via the instant/team messaging features of cooperative and/or multiplayer online games. Such bots are less frequently used for spamdexing and advertising, but mostly for malicious purposes to put strain on the server or for personal amusement to grief. Therefore, they tend to overlap with the more pronounced category of DoS bots.
Once again, they are easy to write in any language that has support for Berkeley sockets, but they may also make use of operating system and hooking APIs to be able to trigger events such as keystrokes and mouse clicks or ones related to the game itself for native interaction with its features.
Note that this paper will primarily focus on the first two classes of spambots: email and forum bots.
OPERATION
Now that we have cleared the introduction and basic taxonomy, we shall examine the technical workings of two rudimentary spambots: an email harvester/mass mailer and a form-filling forum spambot.
1). Email spambot overviewBelow is the source code of a basic proof-of-concept email spambot, which is made using only the Python standard library and harvests email addresses into an array by looking for matches from a regular expression in the page source (page specified via user argument), archives these addresses into a text file and then connects to the Gmail SMTP server to send a message, relying on hardcoded login credentials and user-specified subject and message as arguments.
The design was partially influenced by the spambot described in
this academic resource.
Note that regular expressions need not be able to unearth every RFC-compliant email address. Not only would it be unnecessary, since the vast majority of users stick to reasonable boundaries when naming the local part of their address, but matching every single compliant result would require a regex of
this magnitude. The regex engine will likely go insane trying to parse that (although it seems Perl handles it well, which is no surprise there), resulting in a phenomenon known as catastrophic backtracing. The IETF are laughing their asses off as we speak.
# PoC email spambot for whitepaper demonstration purposes.
# vezzy of evilzone.org
# Permission is hereby granted to use this as you wish.
import urllib2, re, sys, smtplib, email.utils
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText
# Global configuration
account = ''
password = ''
url = sys.argv[1]
subject = sys.argv[2]
message = sys.argv[3]
output_filename = url.split("/")[2]
emails_harvested = []
class EmailHarvester:
def get_page_source(self, url):
return urllib2.urlopen(url)
def harvest_count(self):
print "Number of addresses harvested:", len(emails_harvested)
def harvest_addresses(self):
try:
content = self.get_page_source(url)
except Exception:
print "Errors found: See traceback dump."
for lines in content:
email_regex = r"\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,6}"
comp_email_regex = re.compile(email_regex)
results = comp_email_regex.findall(lines)
email.utils.parseaddr(results) # validate
for addresses in results:
if addresses not in emails_harvested:
emails_harvested.append(addresses)
self.harvest_count()
def output_addresses_to_file(self):
addr = open(output_filename, "w")
for email_address in emails_harvested:
addr.write(str(email_address) + "\n")
addr.flush()
addr.close()
class SpamMailer:
def setup_mailer(self, to, subject, text):
message = MIMEMultipart()
message['From'] = account
message['To'] = to
message['Subject'] = subject
message.attach(MIMEText(text))
MX = smtplib.SMTP("smtp.gmail.com", 587)
MX.set_debuglevel(1) # print SMTP server responses as message is sent
MX.ehlo_or_helo_if_needed()
MX.starttls()
MX.login(account, password)
try:
MX.sendmail(account, to, message.as_string())
except:
print "Error sending message to %s." % (to)
sys.exit()
finally:
MX.close()
def send_to_list(self):
print "Preparing to send message(s).\n"
readfile = open(output_filename, "r")
for address in readfile:
address = address.replace("\n", "")
self.setup_mailer(address, subject, message)
readfile.close()
print "\nMessage(s) sent successfully."
if __name__ == "__main__":
crawler = EmailHarvester()
crawler.harvest_addresses()
crawler.output_addresses_to_file()
mailer = SpamMailer()
mailer.send_to_list()
2). Forum spambot overviewBelow is the source code of a basic proof-of-concept form-filling forum spambot using a browser automation and web testing framework called Splinter. Splinter is an abstraction layer over the well-known Selenium, with the aim of making a crystal-clear, simple and powerful API for developers to conduct web tests in virtually no time. In fact, it is so abstract it could almost be considered pseudocode.
Yet for our purposes it is sufficient, as we want to demonstrate how a common spambot could operate. This particular script is specifically optimized for a vanilla PunBB installation, assuming no additional defense features.
It makes use of an external configuration file that maps HTML input form names so that they can be easily reused. The configuration file is merely an associative array (dictionary) on the basis of key-value pairs, without using special features such as Python's native ConfigParser.
Splinter makes use of web drivers. I have used the Firefox driver, so every time the script is run, a new instance of Firefox will be opened and the written instructions will play out. While this is impractical for a real-life spambot, I have chosen this approach so as to map the bot's actions in a visual real-time manner. In addition, most bots probably won't follow such elaborate browsing patterns, but the benefit of having ones like that is to emulate human user behaviors, something smart bots will do.
It should be noted that Splinter and other testing frameworks allow for usage of headless drivers as well, which means activity can go on without opening an explicit browser instance. This goes to show how benign technology like this meant to easen the lives of people conducting software tests could also make spammers' jobs much easier as well.
Splinter also supports JavaScript execution and interaction with iframes and alert boxes, making it potentially quite useful for spambot construction.
The bot itself works by having some hardcoded global variables (username, password, email, topic subject) and takes the message as an argument. It then navigates through the PunBB forum, clicks on the registration link, fills out the forms with help from the key-value pairs mapping the input names so as to recognize them, waits 60 seconds to imitate a human user taking their time, submits and then waits for the spammer to validate the account (or simply continue) by pausing the process with a raw_input() function. It then logs in with its credentials and finally directly visits the topic submission page for the forum with an ID of 1, posts the user-specified message that was a taken as an argument, submits, destroys all cookies and terminates the process.
Without further ado, the bot:
# PoC PunBB forum spambot for whitepaper demonstration purposes.
# vezzy of evilzone.org
# Permission is hereby granted to use this as you wish.
from splinter import Browser
from config import FormMap
from random import randint
import time, sys
# Initialize web driver
browser = Browser("firefox")
# Global configuration
USERNAME = ["DEADBEEEEEF", "AQW23UUU", "AUTOAUTOAUTO123", "Asdfguy972222"]
PASSWORD = "swordfish"
EMAIL = "nomoneydownbackpayments@mailinator.com"
SUBJECT = "asdfghjkl"
MESSAGE_BODY = sys.argv[1]
class ForumSpambot:
def get_title(self):
print "We have entered", browser.title
def register(self):
browser.fill(FormMap["email"], EMAIL)
browser.fill(FormMap["username"], USERNAME[2])
browser.fill(FormMap["password"], PASSWORD)
browser.fill(FormMap["confirm_password"], PASSWORD)
time.sleep(60) # deliberately take time in order to emulate human user behavior
browser.find_by_name("register").click()
print "Registration submitted."
def login(self):
browser.visit("http://www.bmwfansblog.com/forum/index.php")
raw_input("Press Enter to continue.")
if browser.is_text_present("Login"):
print "There appears to be a login page."
browser.click_link_by_text("Login")
browser.fill(FormMap["username"], USERNAME[2])
browser.fill(FormMap["login_password"], PASSWORD)
browser.find_by_name("login").click()
print "Logged in successfully."
def post(self):
try:
browser.visit("http://www.bmwfansblog.com/forum/post.php?fid=1")
except HttpResponseError, e:
print "Connection failure. Status code %s, reason %s." % (e.status_code, e.reason)
sys.exit()
browser.fill(FormMap["topic"], SUBJECT)
browser.fill(FormMap["message"], MESSAGE_BODY)
browser.find_by_name("submit").click()
print "Topic posted successfully. Terminating session."
browser.cookies.delete()
sys.exit("Dead.")
def navigate(self):
try:
browser.visit("http://www.bmwfansblog.com/forum/index.php")
except HttpResponseError, e:
print "Connection failure. Status code %s, reason %s." % (e.status_code, e.reason)
self.get_title()
if browser.is_text_present("Register"):
print "There appears to be a registration page."
browser.click_link_by_text("Register")
if __name__ == "__main__":
try:
bot = ForumSpambot()
bot.navigate()
bot.register()
bot.login()
bot.post()
except KeyboardInterrupt:
print "\nEnding session."
The configuration file:
# Key-value pair [dictionary] configuration file to map input names for PunBB boards.
# For use in PoC PunBB forum spambot.
# vezzy of evilzone.org
FormMap = {
"username" : "req_username",
"login_password" : "req_password",
"password" : "req_password1",
"confirm_password" : "req_password2",
"email" : "req_email1",
"topic": "req_subject",
"message" : "req_message"
}
COUNTERMEASURES
(AND THEIR OWN RESPECTIVE COUNTERMEASURES)
1. Address munging and URL obfuscationAn often-employed countermeasure is one that has went on to be known as “address munging”. This involves obfuscating an email address in such a way that a user will clearly interpret how it is actually meant to be spelled out, but will drive out bots whose harvesting engines and rulesets will be unable to parse it and/or will avoid it completely.
There are countless ways to munge an email address. Consequently, sophisticated bots need large sets of coded patterns to be able to detect most.
Here is the hypothetical address “sendmespam@example.com” munged using several common techniques:
sendmespam (at) example.com
sendmespam (at) example (dot) com
s e n d m e s p a m @ e x a m p l e . c o m
sendmespam@example.com.removethis
sendzespaz@exazple.coz (replace 'z' with 'm')
sendmespam@example.com (HTML character entity references for '@' and '.')
...etc.
Other munging techniques involve placing the email address as an image instead of plaintext, using client-side scripting (JavaScript) to build an address each time the user visits the page, among others.
The main disadvantage of address munging lies in UX (user experience), namely those who use screen readers and text-based web browsers. Some of the more “creative” munging that starts to cross into the realm of cryptography may even ultimately dissuade regular users from sending an email to the given address.
URL obfuscation is the even more paranoid measure of obscuring regular links so as to avoid them being followed or harvested by crawlers. Generally, this also tends to lure away legitimate agents, such as search engine crawlers.
The most common method is the iconic URI scheme hxxp:// with its variations _ttp:// and h**p://. Others include writing the domain in octal, hexadecimal or abusing HTTP basic authentication.
For example, the URL
http://www.google.com@evilzone.org will lead you to Evilzone. We can obscure any part of the URL with the aforementioned techniques to make it virtually unreadable to the average user. This was commonly used by phishers back in the day, but most modern browsers now detect it and give out a warning.
In the end, these two techniques are not considered to be of real use in fighting spam, as advanced spambot solutions have been trained to spot most common patterns (with the possible exception of client-side scripting) and fix them through text processing procedures, whereas end users will likely be annoyed and abandon their intentions.
Here are a few code snippets that reverse some basic URL obfuscation and address munging techniques, the general logic of which could be employed by spambots:
#!/usr/bin/python
import string
url = "_ttp://evilzone.org"
if url.startswith("h**p://"):
url = url.replace("h**p://", "http://")
elif url.startswith("hxxp://"):
url = url.replace("hxxp://", "http://")
elif url.startswith("_ttp://"):
url = url.replace("_ttp://", "http://")
print url
#!/usr/bin/python
import string, re
address = "s e n d m e s p a m @ e x a m p l e . c o m"
address2 = "sendmespam (at) example (dot) com"
address3 = "moc.elpmaxe@mapsemdnes"
if ' ' in address:
address_unspaced = address.replace(' ', '')
print address_unspaced
if " (at) " and " (dot) " in address2:
address2 = address2.replace(" (at) ", "@")
address2 = address2.replace(" (dot) ", ".")
print address2
if address3:
address3_reversed = address3[::-1] # extended slice
print address3_reversed
2. Spam filtering2.1. Content-control and keyword filteringThis is the most primitive form of spam filtering, which involves analyzing incoming messages for commonly-seen spam keywords and assigning a score to each word. If the score equals or exceeds a certain limit, the message is marked as spam. Even more rudimentary approaches simply look for keyword matches and immediately mark the message.
Generally such filters maintain predefined databases, configurations and/or tables of keywords and spam messages that they use to perform a lookup on the message and find results. The most obvious downside to this is that seeing as there is no way to train the filter beyond its primitive programmed rules, it has a very high potential for false positives, which severely dampen the user experience.
Such filters require consistently updated spam databases if they are to remain relevant, as due to the blacklist nature of content-control, they can quickly grow outdated as new information and spam techniques become formulated and used. On the other hand, as mentioned before they can also start to overzealously block even innocuous content, as content-control typically has no computational linguistics mechanism with which to parse and uncover some sort of context. Thus medical content can also be blocked alongside spam if the filter detects keywords, either as separate words or substrings. This is known as the Scunthorpe problem.
The characteristics which content-control filters look for include the message body, email headers, attachments and so on, but the most rudimentary ones merely analyze only the actual message.
Getting past content-control filters can be a difficult task depending on how the filter actually works, but writing spam messages in a more intelligent manner (as opposed to blatantly obvious 413 scams or advertising) and not using HTML or attachments can help. Nonetheless, they aren't a very high priority for most spammers given their rarity, being replaced in favor of more advanced Bayesian filters. Rather, content-control filters are more commonly used in the domain of parental control and censorware systems, employed by ISPs, public and private institutions alike, as well as, of course, governments – where more false positives are always a good thing.
2.2. Email provider and domain filteringA common and easy to implement defense used by webmasters and forum administrators is to blacklist certain email providers that are used by spammers, such as small underground services or disposable email generators. Others go as far as blocking entire domain ranges (*.ru, *.cz and so on).
While this rudimentary defense is feasible for primitive bots that have a single hardcoded address or use a few well-enumerated providers, it is obviously futile for more advanced ones, plus it is detrimental from a UX perspective. A user may have an email address registered to a blacklisted provider or domain range, and upon seeing the error, will likely give up on registration (or sending a contact inquiry) completely.
2.3. Bayesian filtering and Markovian discriminationBayesian filtering is in many regards similar to content-control filtering, but with a fundamental twist.
To explain how Bayesian spam filtering works, first we must have a basic understanding of Bayesian logic. Simply put, in Bayesian logic nothing is objectively true or false. Instead a relative probability is assumed, which is constantly updated in light of new data. This is in contrast to Boolean logic.
This way Bayesian filters are capable of elementary machine learning. Whereas a primitive content-control filter will always mark certain messages as spam (e.g. ones that contain a suspicious keyword, such as “Viagra”), a Bayesian filter will over time adjust to its rules by taking note both its own decisions, and the user's. Another advantage of this is that each user will have their own individually tuned filter, requiring more effort on the spammer's part to ensure that they can get past it.
In contrast to content-control systems, ideal Bayesian filters build their heuristic lists of spam as they go on, instead of relying on strictly predetermined rulesets and databases. Otherwise, the characteristics that Bayesian filters use are similar to those of content-control filters, which include the message, email headers, meta-info (placement of certain phrases), attachments and so on, though in general they tend to employ more than content-control filters.
A well-designed Bayesian filter will thus maintain a very low of false positives. On the other hand, they are prone to having false negatives, where spam mail is let past as legitimate, if it can be confused.
The technique used by spammers to get around and confuse Bayesian spam filtering is known as Bayesian poisoning. Bayesian poisoning is basically a fancy term for “insert random innocuous words between the spam message to deceive the filter”. Yes, it is just that.
Yet the effectiveness of it depends on how well designed the filter is in the first place, how well does it have its heuristics tuned as time goes on, what words are chosen and how strategically are they put. Most research has shown that the attack vector is largely ineffective, but for more naïve Bayesian filters and with some fine-tuning of the word placement, there can be a modest result.
A successful Bayesian poisoning attack will theoretically inflict double damage. Not only will it let past a false negative, but the chances of having higher false positives increase as the user marks the poisoned message manually.
However, Bayesian poisoning is even less effective by filters that employ Markovian discrimination instead. The key difference is that while Markovian filters still rely on the general Bayesian logic, rather than judging words, they analyze and parse entire phrases, using the Markov model to logically predict what word will follow, given another word. These filters are relatively uncommon, though. The most well-known example is CRM114.
Most modern spam filters are Bayesian ones, including SpamAssassin, SpamBayes, POPFile, Bogofilter and others. There are also cloud-based Bayesian filters for blogs such as Akismet and Mollom, along with their own APIs.
2.4. Challenge-response filteringChallenge-response (C/R) filtering is a well-known, moderately used method to thwart spam by having two distinct phases of communication. The first is actually sending the message, and the second is responding to a “challenge” that is automatically sent back by the server, in order to provide a means of authentication for the sender and add them to the whitelist.
The challenges can vary from the simplest (responding to an automatically sent message) to having to answer a question that requires deep language comprehension. Email confirmation following a registration is also technically an example of C/R.
C/R systems are criticized for being unnecessarily burdening to the UX. Another more severe issue arises from spammers using forged legitimate sender address headers. Doing this effectively shifts the responsibility of having to solve the challenge to the forged victim, rather than the spammer, while also causing unnecessary backscatter.
Additionally, if someone operates a C/R system (especially a more poorly implemented one) on their mail server and then subscribes to a mailing list without whitelisting the respective addresses first, everyone else participating in the list will be prompted with a challenge.
Spammers can also make their own notes of who deploys C/R systems and potentially write a simple procedure that automatically replies to basic challenges, which they can then call if needed. However, some of the more dedicated ones go even further: they exploit cheap human labor to answer challenges for them. Assuming the spammer makes a good profit margin from their activities, it isn't a debilitating issue for them.
Even simpler, if there is no means of email authentication, they can figure out which addresses are whitelisted and spoof the source headers to match one of them as a sender address.
2.5. Rule-based filtering and Sieve scriptingRule-based filtering is the general approach of applying global server-side rules that exclude some sort of pattern or phrase in a message body or header to be passed as legitimate mail, instead having it marked as spam if a rule is matched. Rules can be made out of words, regular expressions or even dedicated scripting language instructions, such as with Sieve.
Sieve is a server-side domain-specific scripting language with a terse syntax designed specifically to write rules for filtering incoming messages. Most major mail servers such as Dovecot, Postfix and Cyrus (from where it was originally developed) have a Sieve interpreter.
The base implementation of Sieve has no variables or loops. Instead there are a few basic control flow statements and a list of built-in commands and functions that act as matching directives.
Here are a few example Sieve scripts:
if address :is ["from", "sender"] "porky.pig@example.com"
{
fileinto "friends";
}
require "fileinto";
if header :contains "X-Spam-Flag" "YES"
{
fileinto "Spam";
}
require "fileinto";
if not exists [“From”, “Date”]
{
fileinto “Spam”;
}
The first script will automatically whitelist all emails from a given address, the second one will automatically put messages marked by SpamAssassin with a spam flag mail header to the spam folder and the third one will put all messages without an origin and date header to the spam folder.
The issues with rule-based filtering are several. First of all, the Scunthorpe problem arises again as there is no way to differentiate legitimate mail from spam containing the same content, seeing as the rules will merely do exactly as they are told. Second, since rules filter the exact content and do not look for relative or similar mutations of a word (unless specifically instructed to, of course), they will let go of email that for example contains the word “casin0” rather than “casino”. Thirdly, they require constant maintenance which may be time-consuming, or even damaging if an unintentionally erroneous rule is inserted that goes by undetected by the administrator for a period of time.
Some spammers have also gone by using HTML and MIME to bypass rule filtering while still sending out a readable message to their recipients.
3. CAPTCHAEveryone is familiar with the ubiquitous CAPTCHA – perhaps not by name, but they have most certainly completed many of these to determine they are human. CAPTCHA as an acronym stands for Completely Automated Public Turing test to tell Computers and Humans Apart.
There is no definitive standard or implementation of CAPTCHA. Rather, it is a broad concept that applies to many different anti-spam measures with the aim of validating human users and stopping artificial agents. Many CAPTCHA standards are proprietary in nature, which means that by definition they also rely on the undesirable security through obscurity, rather than heeding Kerckhoffs's principle for secure cryptosystems and general security applications.
The most common form of CAPTCHA is a textual CAPTCHA, which involves having the user type the symbols featured in an image (whether dynamically generated by an algorithm or statically from a list of predetermined values), and most often distorted, but not always so, especially historically.
The effectiveness of CAPTCHAs varies wildly, as many are poorly implemented or not well thought out in what filters they apply to distort the image, making them susceptible to the most common attack used to fight this method of protection: optical character recognition (OCR).
Optical character recognition presents a classic problem and subset in the broad and complex fields of computer vision and artificial intelligence, among others. It deals with analyzing and converting of scanned and/or electronic representations text into text that is encoded and can be naturally parsed by machines. The process of OCR involves many stages and lots of graphics processing, but we'll be covering it here in the context of CAPTCHA subversion.
CAPTCHAs have long been controversial as being detrimental to UX, especially visually impaired users. Its opponents propose various other anti-spam measures, and some are very vocal in their opposition of CAPTCHA. This is actually true to some extent, with some implementations distorting images to such a high level that they become cryptic and virtually unreadable even for well-sighted users. This has also been the subject of much humor, however the algorithms have gotten more reasonable in recent times and many implementations also offer the choice of an audio CAPTCHA.
3.1. Regular (Textual) CAPTCHABy “regular CAPTCHA”, we are referring to the classic textual CAPTCHA implementations that were particularly popular before the massive rise of contemporary social networking. These CAPTCHAs are all alike in the sense that they are alphanumeric strings of characters, with varying amounts of distortion, filters, skewing, obfuscation and so on.
They also vary in how they are generated: dynamically through a server-side algorithm, or statically through a list of values. Most CAPTCHAs load a new value every time the page is refreshed, however some remain the same for the entire session. The default CAPTCHA for SMF is an example of the latter. Obviously this makes a spammer's job much easier, as they can directly pull the value from the server, dump it offline, crack it and insert the OCR-discovered string back into the form. This technique is usually superfluous, but can be used to make tables of pre-cracked CAPTCHAs and their hash-value maps for use in integrated spambot solutions.
Some schemes expose the generation routines to the client-side, which obviously enables users to study its workings and inject arbitrary values into an HTTP session to bypass it. If CAPTCHAs are validated from a finite list of hashes, then they can be bruteforced.
Generally in order to crack CAPTCHAs, one needs an OCR engine and a graphics/image processing library in order to apply filters that minimize the distortion and rid of perturbation so the engine will have a higher chance of interpreting a correct value. The most widely used OCR engine is Tesseract, which was first developed by HP Labs in 1985 and later on was open-sourced and turned over to Google. There exist Python bindings for it, as well: python-tesseract and PyTesser (which hasn't been updated since its initial release). Graphics processing libraries include Leptonica and PIL.
Many methods and tools have been written to crack basic CAPTCHA schemes. Among the first and most well-known was PWNtcha by Sam Hocevar (who you may also know as the author of the WTFPL), cintruder, TesserCap and so on. Many scripts for individual schemes exist.
OCR cracking optimization for CAPTCHAs depends on the individual scheme, but most of them follow the same basic process. This involves removal of background noise, segmentation of text into separate characters (harder if they overlap) and identifying these characters. There's plenty of ways to implement the specifics. One could extract the text from the clutter by discriminating pixel groups. Character recognition could even involve reading from a graphical font set or through hand-written feature vectors that are fed into a neural network.
Nonetheless, to make this more difficult, CAPTCHAs could implement character overlapping, case sensitivity, non-alphanumeric/special symbols, background and text gradients, differing character alignment, varying brightness and so on. On the down side, this will hamper UX and readability. Best to assume that your CAPTCHA will be cracked, as long as most spammers are weeded out, or one can simply use a different line of defense.
Image-recognition CAPTCHAs are inherently more secure due to the difficulty of programatically recognizing names of the foreground in pictures, however a computational attack more sophisticated than brute-force has also been discovered for them. They are also rarely used (Vidoop was one implementation, but has been defunct since 2009).
As a footnote, social engineering can also be used to solve CAPTCHAs. Spammers can set up free online services that require a CAPTCHA to be solved in order for the service to be unlocked. These are then stored into a hash-value map for later use. A more elaborate version of this, which is a variation of the clickjacking attack, has been dubbed “CaptchaJacking”.
3.2. textCAPTCHAtextCAPTCHA generally refers to a common mainstream implementation of an otherwise old anti-spam technique. This does not rely on identifying characters from images, but instead on answering logical questions, ones that are meant to be plain obvious to a human user, but that would require contextual parsing by a bot.
Examples from the textCAPTCHA website include:
In the number 3028073, what is the 7th digit?
Which of tooth, blue or monkey is a colour?
Which word is all in capitals: tailwinds, either, ABDICATE or procurators?
Which word from the list “nimbler, fagging, recapture” contains the letter “t”?
and so on.
While this is much more convenient for visually impaired users and does increase accessibility significantly, the aforementioned contextual parsing would not be difficult to integrate and makes textCAPTCHA inherently weaker than other schemes, which is something the authors openly admit.
An application that solves these called TextCaptchaBreaker exists and was written as a proof-of-concept in Python back in 2010. The source code is freely available
hereand we will look at a couple of code samples. The author claims a 99% success rate at the time it was written and tested.
The application supports most major patterns, and it applies the same basic principles for all of them. It tokenizes the sentence into separate words, looks for certain keyword(s) and if it finds a match, it initiates a procedure that is based on the assumption of what a question is by its keyword(s) in order to extract the answer (e.g. for the keyword of “name”, it will look for proper nouns, or capitalized words in the sentence and submit that while omitting articles and other irrelevant words). This works due to the fairly limited scope in textCAPTCHA patterns.
BodyPartsPattern.py:
'''
Created on Nov 11, 2010
@author: Sajid Anwar
'''
import re
parts = ['hair', 'ankle', 'thumb', 'toe', 'eye', 'chin', 'head', 'chest', 'face', 'stomach',
'hand', 'heart', 'arm', 'ear', 'nose', 'foot', 'knee', 'leg', 'elbow', 'finger', 'tongue',
'tooth', 'brain']
def solve(question):
tokens = re.sub(r'[^\w\d\s]', '', question).lower().split(' ')
found = []
# Body parts:
# 'Cat, apple, finger, elephant or hospital: the body part is?' - finger
# 'The list chin, cat, head, toe, T-shirt and hair contains how many body parts?' - four
# 'Ant, snake and eye: how many body parts in the list?' - one
# 'Cat, apple, finger, elephant or hospital: the body part is?' - finger
if 'body' in tokens and ('part' in tokens or 'parts'):
for i in range(len(tokens)):
# Go through each token and save those that are body parts.
if tokens[i] in parts:
found.append(tokens[i])
else:
return None
if len(found) == 1:
if 'parts' in tokens:
return '1'
else:
return found[0]
elif len(found) > 1:
return str(len(found))
else:
return None
The script begins by storing a list of body parts into an array and then initiates a procedure which tokenizes the question into an easily parsable format through regex and built-in string formatting. It then iterates through each token with a for-loop and appends those tokens that are body parts to the array. Finally, judging by the length of the array, it returns a true or false value, but if the array length exceeds 1, the array is converted to a string and then returned.
WhichNumberPattern.py:
'''
Created on Nov 11, 2010
@author: Sajid Anwar
'''
import re
single_numbers = {
'zero': 0, 'ten': 10,
'one': 1, 'eleven': 11,
'two': 2, 'twelve': 12,
'three': 3, 'thirteen': 13,
'four': 4, 'fourteen': 14,
'five': 5, 'fifteen': 15,
'six': 6, 'sixteen': 16,
'seven': 7, 'seventeen': 17,
'eight': 8, 'eighteen': 18,
'nine': 9, 'nineteen': 19,
'hundred': 100
}
split_numbers = {
'twenty': 20, 'thirty': 30, 'forty': 40,
'fifty': 50, 'sixty': 60, 'seventy': 70,
'eighty': 80, 'ninety': 90
}
ordinals = {
'1st': 1, 'first': 1,
'2nd': 2, 'second': 2,
'3rd': 3, 'third': 3,
'4th': 4, 'fourth': 4,
'5th': 5, 'fifth': 5,
'6th': 6, 'sixth': 6,
'7th': 7, 'seventh': 7,
'8th': 8, 'eighth': 8,
'9th': 9, 'ninth': 9,
'10th': 10, 'tenth': 10
}
def solve(question):
tokens = remove_empty(re.sub(r'[^\w\d\s]', ' # ', question).lower().split(' '))
found = []
which = -1
# Which number:
# 'What is the 2nd number in the list nineteen, 23 and twenty nine?' - 23
if 'number' in tokens or \
'largest' in tokens or 'biggest' in tokens or 'highest' in tokens or \
'smallest' in tokens or 'lowest' in tokens:
for i in range(len(tokens)):
# Go through each token and save those that are numbers and those that are ordinal.
if tokens[i] in ordinals:
which = ordinals[tokens[i]]
elif is_num(tokens[i]):
found.append(int(tokens[i]))
elif tokens[i] in single_numbers:
if single_numbers[tokens[i]] == 100 and i != 0 and tokens[i - 1] in single_numbers:
found.append(single_numbers[tokens[i-1]] * 100)
else:
found.append(single_numbers[tokens[i]])
elif tokens[i] in split_numbers:
if i + 1 != len(tokens) and tokens[i + 1] in single_numbers:
found.append(split_numbers[tokens[i]] + single_numbers[tokens[i + 1]])
tokens[i + 1] = None
else:
found.append(split_numbers[tokens[i]])
if which == -1:
if 'largest' in tokens or 'biggest' in tokens or 'highest' in tokens:
return str(max(found))
elif 'smallest' in tokens or 'lowest' in tokens:
return str(min(found))
else:
return None
else:
return str(found[which - 1])
else:
return None
def is_num(str_value):
try:
int(str_value)
except:
return False
return True
def remove_empty(list):
while '' in list:
list.remove('')
return list
The script begins by defining three separate lists of numbers as key-value pairs (dictionaries, or associative arrays), mapping their alphabetical value to their numerical value. It also defines a control variable of 'which'. It then initiates a procedure where the sentence is tokenized (as well as piped through a function that strips blank array entries), and then checks for keywords in the tokenized sentence. It then iterates through the tokens with a for-loop and saves all numbers and ordinals, appending them into the array and changing the control variable. If the control variable is unchanged, it looks for quantitative adjectives and a returns a corresponding value from the array, depending on whether the adjective implies a maximum or minimum value.
Another tool that TextCaptchaBreaker pipes its output to and that is mentioned by the textCAPTCHA developers is Wolfram Alpha, which is a computer answer engine that can parse direct human questions and return a compiled, definitive answer. Such advances illustrate the triviality of such CAPTCHA schemes, but nevertheless they are often recommended for the sake of UX.
Simple arithmetic questions were once used, but have now fallen out of favor most of the time because of how easy it is to grep and evaluate such statements. However, they are significantly more effective if the question is asked in a non-English language.
3.3. reCAPTCHAreCAPTCHA is a proprietary CAPTCHA solution, and arguably the most famous and deployed one. It was originally developed at Carnegie Mellon University and later acquired by Google in 2009.
The project acts as both an anti-spam solution and an OCR research project, as all of its words are taken from archives of old books and newspapers, the content of which cannot be deciphered by OCR technology and is instead fed to users through a callback via JavaScript API, independently verified across several of them and sent back to the main servers as solved.
The system is pretty resistant to spam due to its constant and frequent modifications. It has also changed its generation engine throughout the years, employing different distortion patterns (such as mixing black and white colors to achieve an ink blot effect across the text) and as of 2012, it also uses house numbers from Google Street View.
Nevertheless, it has been criticized as being far too obfuscated to the point of being unreadable and cryptic. This has indeed been the case, both in past and present variants and has been the subject of much criticism, although Google does also offer an audio stream to go with the textual CAPTCHA. The application also allows for new CAPTCHAs to be loaded asynchronously in real time.
What's notable is that the way reCAPTCHA is designed means it does not necessarily require an entirely correct input to be given, but a relatively correct one. The reason for this is that a reCAPTCHA string consists of two words: a control word that is already known by OCR engines, and one that is not. So long as the control word is properly filled in, one can cheat the other. 4Chan exploited this technique by flooding the word “penis” for the unknown one, which also resulted in an inadvertent poisoning of the CAPTCHA pool.
Perhaps the most successful attack on reCAPTCHA involved researchers from the group DC949 who in 2012 attacked the weakest link of the system: the audio reCAPTCHA. At the time it had a much more limited vocabulary (or “keyspace” in crypto terms) and the audio quips were limited to 8 seconds in length, making them easily recognizable through speech synthesis. The first run had a 99.1% success rate. While Google later adjusted it just hours before their talk was scheduled, they were still able to get success rates of around 60%. They also exploited the aforementioned control word technique, but in an even more creative way. They realized that by making so-called “word merges”, they could match multiple words simultaneously. For instance, “Waon” matched “One”, “Van” and “Wagon”. These effects have since been mitigated, however.
3.4. Solve Media CAPTCHASolve Media is a proprietary CAPTCHA solution particularly notable for (quite ironically) focusing just as much on monetizing CAPTCHAs as preventing spam. Their challenges usually involve typing in advertising phrases, selecting choices from a drop-down list or even forcing the viewer to watch streaming advertisements and have the challenge printed on screen during the middle of its duration.
They're not commonly deployed, but can be seen on premium file hosting services. It is debatable whether their service could be considered spam itself, if some of its modes can be recognized by OCR, or if the application is vulnerable to client-side exploits.
The easiest method that spammers use to get past these is true dedicated human labor and CAPTCHA solving services, many of which can be discovered in the basic Surface Web.
3.5. KeyCAPTCHAKeyCAPTCHA is another proprietary and infrequently seen solution that makes the user solve a challenge either in the form of an interactive jigsaw-esque puzzle, more specifically they are given an partially disassembled image and are made to add the missing pieces in their proper places, with a small representation of the fully assembled image given in the top right corner for visual reference. Another type of puzzle is to sort images into certain visual categories. These images are often distorted.
This type of CAPTCHA is often seen as particularly arduous and time-consuming by users, which is why it remains unpopular and has the potential to drive away users. Theoretically one could use an API that supports high-level mouse interaction methods and devise a brute-force search algorithm to try various orders until the proper one is found. Splinter and other acceptance testing frameworks do offer one, although whether it is suited to interact with a KeyCAPTCHA cannot be stated for certain. Client-side exploits are another matter, but much more realistically, a spammer could use CAPTCHA solving services and human labor to get around these. Xrumer claims to be able to get past some of them, but these claims have not been tested.
It's worth noting that the
Joomla! Wiki lists KeyCAPTCHA as a vulnerable extension under the vulnerability class of ID (information disclosure), which is defined as “account information or session information publicly viewable, or passed to third party without knowledge.”
4. Spamtrap/honeypot email addressesA spamtrap is an email address that is set aside specifically to lure spambots. This is usually accomplished by including it in markup, but styling it in such a way (or commenting it out) so it cannot be seen by a user from their browser.
Often these spamtraps are merely used as a decoy to distract spambots away from legitimate content on page, but the messages sent to this address are used by some researchers to analyze contemporary spam techniques and add them to heuristic databases.
There are several downsides to spamtraps: first of all, dedicated spammers maintain large blacklists of unwanted addresses, creating an arms race. After it has been uncovered, some spammers may systematically bombard the inbox with messages to poison the sentimentality of what is considered unsolicited spam mail, since spamtraps process all messages by default as spam. They can also be used to forge From: headers and cause unwanted backscatter (abundance of bounce/failed delivery messages).
A majority of spamtraps are also configured to treat entire email addresses as spam sources, which could lead to innocent users ending up blacklisted as spammers if they somehow choose to write to a spamtrap, or if they reply or forward a message that has been sent to a spamtrap.
5. Hidden input form or checkboxThis is a common defense suggested as an alternative for CAPTCHA in forums. It revolves around the theory that a spambot will indiscriminately fill out all forms, including ones with the input type set to “hidden”, thus giving away that they are a bot. The other method involves a checkbox that will either require it to be checked to prove that the registration is done by a human, or be checked by default and require it to be unchecked.
The issue with this is that any bot that uses a library or its own procedures to enumerate HTML inputs before filling out forms can simply ignore all that have the 'type=”hidden”' attribute set. Parsers like BeautifulSoup or lxml can enable this quite reasonably. The checkbox, on the other hand, can also be enumerated and the bot can then initiate a procedure that uses basic text analysis and tokenizing techniques as seen in TextCaptchaBreaker to determine whether or not the checkbox must be ticked, using a browser automation library or a playback parameter.
This process will get more difficult if CSS is used to hide forms. However, if the webmaster achieves thus by using the style attribute, then the bot can simply look for strings such as “visibility:hidden” and “display:none”. If it is done centrally through the stylesheet, it becomes more difficult, although high-level CSS parsers do exist for the more dedicated spammer. One for Python is called cssutils.
If the form is hidden through JavaScript, then this necessitates the usage of high-level browser automation and abstraction APIs like Splinter or Selenium, although parsing the actual JavaScript is still a difficult task. Most spambots obviously don't do this, and neither do search engine crawlers for performance reasons, but it is possible if designed in an intelligent manner.
6. DNS blacklistingA DNS blacklist, also a DNS-based Blackhole list (DNSBL) is a list of IP addresses mapped through the Domain Name System, and distributed either as zone files for integration in mail transfer agents and DNS server software, or as a live zone that can be queried by outside agents.
Another class of DNSBL is the URI DNSBL, which collects domain names and IP addresses that are known to have been registered by spammers and spamdexers for their express purposes, and are not normally seen outside of their unsolicited mail. There is also a subclass of the URI DNSBL which specifically checks for domain names in the “From:” or “Reply-To:” headers. These tend to be less effective as the headers are often forged, as noted previously.
DNSBLs are widely available and used, with some having noted histories and standards. Nonetheless, as with any system based on blacklist principles, they have the potential to be less effective or even damaging. Legitimate users can end up becoming blacklisted simply due to having the misfortune of being on a shared mail server used by a spammer. Static addresses may also find themselves in by error. Some lists may also have unlikable criteria for inclusion and exclusion, e.g. listing without having actually sent spam but only opening a connection or requiring payment for delisting. One can also falsely end up in a blacklist if data is taken from a spamtrap that was abused by a spammer.
Therefore, as with any other tool, they are to be used with caution. Activists like John Gilmore has referred to the effects of DNSBLs being comparable to those of monopolies. Worth noting is that spammers really fucking hate them, too, though. The largest distributed denial-of-service attack in history was against Spamhaus in March of 2013, which peaked at 300 Gbit/s.
7. List poisoningList poisoning is an old but sneaky anti-spam technique, which is of especial use when webmasters host large mailing lists. The technique involves mixing in (“poisoning”) the lists with fake email addresses, with the hopes of throttling the spammer's resources or potentially having his automated tools start to encounter errors. These can also be randomly scattered across pages just to troll them, too.
The effectiveness of this depends on how the invalid addresses are generated. If they are gibberish, but syntactically valid, then a basic harvester would chomp them up and send without further analysis. Yet if they do not obey RFC standards or are merely munged using some way, then spammers could use regexes and text processing patterns to either reverse the munging before storing the addresses, as we demonstrated above in address munging, or discard it entirely. A sufficiently dedicated spammer could also devise a function that judges the address's information entropy, and discard it if it proves too high (as would be for gibberish).
In addition, the fake addresses could, even though rarely, end up legitimate simply due to the principles of Birthday probability, if the address generation algorithm is not random enough, or simply as time goes on. What's more is that even if the local part of the address is nonexistent, a valid domain name will mean that the target domain's MX servers still get exhausted by the spammer's sending devices.
One of the most common scripts to execute the list poisoning technique is Wpoison, written in Perl.
8. Flood controlThis is a basic quality-of-service (QoS) action that involves limiting the rate of messages that can be sent in a given timeframe, whether it be forum posts, emails, IRC messages, etc.
Additionally mail servers can be set to throttle after a given number of messages have been sent in order to aggravate spammers and force them to quit. Some mail servers are also deliberately configured to slowly process incoming messages. These are known as tarpits. (see also section 9.3)
This solution is least effective and almost worthless for forum spambots, as they can be easily programmed to post in a predetermined pattern that assumes flood control is on by default, e.g. time.sleep(20) before every post. However, it does stop and/or confuse poorly coded and rudimentary bots, usually playback ones.
9. Request fingerprinting and analysisRequest fingerprinting involves analyzing and filtering out not spam itself, but the actual traffic generated by spammers, and in particular their automated tools. Due to the often poor and predictable nature of spamware, it usually leaves specific patterns in its headers that it can be filtered out, preventing spam before it was even sent.
9.1. HTTP fingerprintingHTTP fingerprinting involves performing analysis on HTTP requests sent out by spamware and looking for any curiosities, malformations or deviation from standards that can be used to infer that they're sent by an automated tool rather than a human user from a web browser. The principles of this in many ways resemble those of a web-application firewall (WAF).
The first and most obvious way is through user agent filtering. As you may have noticed, neither of our example bots spoof their user agent. Our email spambot when harvesting addresses uses the default Python-urllib agent. Our forum spambot uses the default user agent for the Firefox browser session, which while being alike with that of a genuine user, can still be spoofed as an additional measure.
User agent spoofing with the Python standard library:
import urllib2
headers = { “User-Agent:” : “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36” }
reqstring = urllib2.Request(“www.google.com”, None, headers)
sendreq = urllib2.urlopen(reqstring)
In Splinter:
from Splinter import Browser
browser = Browser(user_agent=” Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36”)
Other methods are to look for response headers in a supposedly pure HTTP request, such as Content-Range, which is a dead giveaway for a bot. Bots may also naively use HTTP/1.0 headers, such as Cache-Control in requests that they mark as being HTTP/1.1. Mangled headers in general that defy RFC standards should be filtered, as well as incomplete or unusually empty ones.
The referer can also be probed to see if it's blank (indicative of a bot) or if it uses a relative URL address. Although not technically invalid to use one, all known legitimate user agents feature absolute URLs. One can also craft a special cookie and look for any sudden suspicious changes in the request structure (user agents that switch on the fly and so on).
Another good method is to check the validity of incoming search engine crawlers, which are a common disguise employed by spambots (and privacy-aware users). Since these bots have publicly known IPv4 address ranges, a reverse CIDR scan on them will do the job. If they fail, they can be trashed. A more convoluted solution could involve marking crawler user agents and seeing if they attempt to access the disallowed directories enumerated in the robots.txt file, and block them if they do. This could be extended as a general purpose defense for intruders in general.
These techniques can certainly be countered, but they actually require the spammer to be knowledgeable of web protocol standards and take the time to create a standards-compliant spambot that emulates legitimate browser traffic (or they can make use of an actual browser agent included in testing frameworks, as described previously).
A solution that implements all of these tactics is Bad Behavior, which is used by several major websites.
9.2. Callback verificationCallback verification is a technique used by mail transfer agents to determine the legitimacy of sender addresses. It basically involves the exact same method as would be used to bounce back a message, only without actually sending content.
Due to its primitivity, it is only effective when dealing with forged addresses. It becomes absolutely ineffectual when spammers spoof legitimate addresses. Most MTAs caution against its use because of the many inadvertent side effects that can occur. For instance if the mail server does a callback to a spamtrap, it might end up getting blacklisted somewhere and poison the spamtrap's efficacy. Spammers may also deliberately use a spoofed genuine address and send mail to a bunch of MX servers which do callbacks, creating a denial-of-service effect to the source address.
Server (mis)configurations can also be a burden, for instance if the user's server does not specify bounce addresses, rejects bounce messages altogether or if there is a catch-all address (email address that accepts all emails addressed to a domain, even if the host part does not exist in the server), the latter of which will treat everything as valid.
9.3. Miscellaneous RFC standards complianceAmong which include checking the HELO or EHLO requests supplied by hosts connecting to a mail server, and dropping them if they are invalid as defined in RFC 5321, checking if the connection has been properly closed by hosts (through SMTP QUIT). These can be countered by writing standards-compliant spambots, as mentioned previously.
Some mail servers may implement what's known as tarpitting: delaying incoming connections with the assumption that a spammer will opt out, since it will be unprofitable to hold on in order to send messages. If the spammer utilizes a botnet of zombie computers, then this will not dissuade them as much, as they will be opening multiple connections and utilizing foreign resources.
CONCLUSION
This has been a whitepaper on spam and anti-spam mainly from the perspective of constructing and thwarting automated spambots and spamware, while also getting into the mindset of a spammer by elaborating on plausible countermeasures to common defenses. This paper is by no means encompassing of the entire subject that is spam, which has practically grown into its own separate field and business, both from the side of those spreading it and those preventing it.
This was written to fill in a gap where it came to information about the mechanisms of spamware in particular, and with the hope that it will be useful, educational or entertaining. Feel free to redistribute this paper, so long as its contents are unaltered.
For further reading, I recommend the book
Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification by Jonathan Zdziarski. If you want something entertaining, then go read
Spam Kings: The Real Story Behind the High-Rolling Hucksters Pushing Porn, Pills and @*#?% Enlargements.
REFERENCES
https://en.wikipedia.org/wiki/Spambothttp://www.botmasterlabs.net/http://mdeering.com/posts/001-fight-spam-with-javascripthttp://uxmovement.com/forms/captchas-vs-spambots-why-the-checkbox-captcha-wins/http://nedbatchelder.com/text/stopbots.htmlhttp://www.campaignmonitor.com/blog/post/3817/stopping-spambots-with-two-simple-captcha-alternativeshttp://email.about.com/cs/bayesianfilters/a/bayesian_filter.htmhttp://spambayes.sourceforge.net/background.htmlhttp://bad-behavior.ioerror.us/about/http://www.cs.utoronto.ca/~noam/spambot.htmlhttp://www.ex-parrot.com/pdw/Mail-RFC822-Address.htmlhttp://blog.shubh.am/catpchajacking-a-new-approach-on-bypassing-the-captchahttp://docs.joomla.org/Vulnerable_Extensions_List#KeyCaptchahttp://www.mcafee.com/us/resources/white-papers/foundstone/wp-attacking-captchas-for-fun-profit.pdfhttp://www.cs.berkeley.edu/~tygar/papers/Image_Recognition_CAPTCHAs/imagecaptcha.pdfhttp://www.caca.zoy.org/wiki/PWNtchahttp://www.cs.columbia.edu/~mmerler/project/Final%20Report.pdf http://www.boyter.org/decoding-captchas/https://github.com/kbhomes/TextCaptchaBreakerhttps://www.youtube.com/watch?v=rfgGNsPPAfU