Captchas

Captchas – those twisty distorted words you have to decipher and type in to access a website – have been around since the 1990s. Their original purpose was to tell the difference between a human user and an automated system, by requiring the user to answer a challenge – one that was supposedly hard for computers to solve, but easy for humans. A few years later they acquired the name CAPTCHA, an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”.

Optical character recognition was pretty inaccurate in the 90s, especially with blurry or misaligned text. Text that was clearly legible to a human was completely inscrutable to state of the art OCR software. The first developers of CAPTCHA took all the advice for getting accurate OCR scans, and did the opposite – intentionally creating text that would be readable, but impossible for OCR to parse. This worked fairly well to differentiate humans and robots for a while, but eventually technology began to catch up. Off the shelf OCR got better, and mechanical attacks specific to commonly used CAPTCHAs were getting good enough to answer a significant fraction correctly.

Attempting to combat that by making CAPTCHAs harder to solve mechanically also made them much harder for humans to solve, and was terrible for accessibility. That arms race had gone about as far as it could.

Meanwhile, a group at CMU realized that the millions of hours of human time wasted by solving CAPTCHAs could be applied to do useful work instead. They had a lot of scanned documents they wanted to digiitize, but the quality of the images was too poor to OCR. So they mechanically extracted single words from the documents and showed them, two at a time, to users as a CAPTCHA and asked them to enter the two words. They knew the right answer for one word, so if the user entered that word right they’d assume they probably got the unknown word right too (and, almost as an aside, allowed the user access). That was “reCAPTCHA”.

By making reCAPTCHA a hosted service it became much simpler for website owners to use CAPTCHAs, so they began to be used more widely. reCAPTCHA was acquired by Google in 2009 and they kept developing it. They used it to digitize street numbers for Google street view, and added “pick matching images” as an alternative puzzle. Increasingly, though, they didn’t actually need the humans to solve puzzles in the common case – they could tell from the history of the connecting IP address and the behaviour of the web browser that a user was likely legitimate, and let them through without making them do anything other than check a box. It was tracking reputation instead.

If you’ve ever used TOR – the secure browser that hides any identifying cookies and mixes your web traffic untraceably in with other TOR users you’ll have seen what the web looks like to users with a poor reputation. Every time you see a reCAPTCHA it’s not just a simple checkbox, it’s answering dozens of visual puzzles. That’s what most bots will see with reCAPTCHA.

reCAPTCHA is no longer really based on separating humans from bots so much as it’s differentiating between “normal”, “good” users – probably human – and “bad” users – probably bots – based on many characteristics of the users session. It’s really pretty good at that, and with the “just check a box” version of reCAPTCHA most users will see most of the time it’s pretty low friction. It’s probably the most effective tool against subscription bombing at the moment.

Google have just released version 3 of reCAPTCHA. This isn’t really a CAPTCHA at all, rather it’s a way to track the behaviour and reputation of users as visible to Google. It watches your users as they interact with your site, sends all that data to Google and they provide a quality score for each user that you can combine with your own information about them to make decisions.

It’s all very low friction, and probably very effective at detecting malicious bots. But it’s also a very intrusive user tracking technology that’ll send user history to Google whenever they’re on a site that uses it. The list of information it captures is definitely enough to uniquely fingerprint and track a user. It’ll be interesting to see what happens when that collides with the move towards web browsers being more privacy-focused and hostile to tracking.

Related Posts

Are seed lists still relevant?

Those of you who have seen some of my talks have seen this model of email delivery before. The concept is that there are a host of factors that contribute to the reputation of a particular email, but that at many ISPs the email reputation is only one factor in email delivery. Recipient preferences drive whether an email ends up in the bulk folder or the inbox.

The individual recipient preferences can be explicit or implicit. Users who add a sender to their address book, or block a sender, or create a specific filter for an email are stating an explicit preference. Additionally, ISPs monitor some user behavior to determine how wanted an email is. A recipient who moves an email from the bulk folder to the inbox is stating a preference. A person who hits “this-is-spam” is stating a preference. Other actions are also measured to give a user specific reputation for a mail.
Seed accounts aren’t like normal accounts. They don’t send mail ever. They only download it. They don’t ever dig anything out of the junk folder, they never hit this is spam. They are different than a user account – and ISPs can track this.
This tells us we have to take inbox monitoring tools with a grain of salt. I believe, though, they’re still valuable tools in the deliverability arsenal. The best use of these tools is monitoring for changes. If seed lists show less than 100% inbox, but response rates are good, then it’s unlikely the seed boxes are correctly reporting delivery to actual recipients. But if seed lists show 100% inbox and then change and go down, then that’s the time to start looking harder at the overall program.
The other time seed lists are useful is when troubleshooting delivery. It’s nice to be able to see if changes are making a difference in delivery. Again, the results aren’t 100% accurate but they are the best we have right now.
 

Read More

September 2015: The month in email

SeptemberCalendarSeptember’s big adventure was our trip to Stockholm, where I gave the keynote address at the APSIS Conference (Look for a wrapup post with beautiful photos of palaces soon!) and had lots of interesting conversations about all things email-related.
Now that we’re back, we’re working with clients as they prepare for the holiday mailing season. We wrote a post on why it’s so important to make sure you’ve optimized your deliverability strategy and resolved any open issues well in advance of your sends. Steve covered some similar territory in his post “Outrunning the Bear”. If you haven’t started planning, start now. If you need some help, give us a call.
In that post, we talked a bit about the increased volumes of both marketing and transactional email during the holiday season, and I did a followup post this week about how transactional email is defined — or not — both by practice and by law. I also wrote a bit about reputation and once again emphasized that sending mail people actually want is really the only strategy that can work in the long term.
While we were gone, I got a lot of spam, including a depressing amount of what I call “legitimate spam” — not just porn and pharmaceuticals, but legitimate companies with appalling address acquisition and sending strategies. I also wrote about spamtraps again (bookmark this post if you need more information on spamtraps, as I linked to several previous discussions we’ve had on the subject) and how we need to start viewing them as symptoms of larger list problems, not something that, once eradicated, means a list is healthy. I also posted about Jan Schaumann’s survey on internet operations, and how this relates to the larger discussions we’ve had on the power of systems administrators to manage mail (see Meri’s excellent post here<).
I wrote about privacy and tracking online and how it’s shifted over the past two decades. With marketers collecting and tracking more and more data, including personally-identifiable information (PII), the risks of organizational doxxing are significant. Moreso than ever before, marketers need to be aware of security issues. On the topic of security and cybercrime, Steve posted about two factor authentication, and how companies might consider providing incentives for customers to adopt this model.

Read More

Outrunning the Bear

bear
You’ve started to notice that your campaigns aren’t working as well as they used to. Your metrics suggest fewer people are clicking through, perhaps because more of your mail is ending up in junk folders. Maybe your outbound queues are bigger than they used to be.
You’ve not changed anything – you’re doing what’s worked well for years – and it’s not like you’ve suddenly had an influx of spamming customers (or, if you have, you’ve dealt with them much the same as you have in the past).
So what changed?
Everything else did. The email ecosystem is in a perpetual state of change.
There’s not a bright line that says “email must be this good to be delivered“.
rideInstead, most email filtering practice is based on trying to identify mail that users want, or don’t want, and delivering based on that. There’s some easy stuff – mail that can be easily identified as unwanted (malware, phishing, botnet spew) and mail that can easily be identified as wanted (SPF/DKIM authenticated mail from senders with clean content and a consistent history of sending mail that customers interact with and never mark as spam).
The hard bit is the greyer mail in the middle. Quite a lot of it may be wanted, but not easily identified as wanted mail. And a lot of it isn’t wanted, but not easily identified as spam. That’s where postmasters, filter vendors and reputation providers spend a lot of their effort on mitigation, monitoring recipient response to that mail and adapting their mail filtering to improve it.
Postmasters, and other filter operators, don’t really care about your political views or the products you’re trying to sell, nor do they make moral judgements about your legal content (some of the earliest adopters of best practices have been in the gambling and pornography space…). What they care about is making their recipients happy, making the best predictions they can about each incoming mail, based on the information they have. And one of the the most efficient ways to do that is to look at the grey area to see what mail is at the back of the pack, the least wanted, and focusing on blocking “mail like that”.
If you’re sending mail in that grey area – and as an ESP you probably are – you want to stay near the front or at least the middle of the grey area mailers, and definitely out of that “least wanted” back of the pack. Even if your mail isn’t great, competitors who are sending worse mail than you will probably feel more filtering pain and feel it sooner.
Some of those competitors are updating their practices for 2015, buying in to authentication, responding rapidly to complaints and feedback loop data, and preemptively terminating spammy customers – and by doing so they’re both sending mail that recipients want and making it easy for ISPs (and their postmasters and their machine learning systems) to recognize that they’re doing that.
Other competitors aren’t following this years best practices, have been lazy about providing customer-specific authentication, are letting new customers send spam with little oversight, and aren’t monitoring feedback and delivery to make sure they’re a good mail stream. They end up in the spam folder, their good customers migrate elsewhere because of “delivery issues” and bad actors move to them because they have a reputation for “not being picky about acquisition practices“. They risk spiraling into wholesale bulk foldering and becoming just a “bulletproof spam-friendly ESP”.
If you’re not improving your practices you’re probably being passed by your competitors who are, and you risk falling behind to the back of the pack.
And your competitors don’t need to outrun the bear, they just need to outrun you.

Read More