Captchas

Captchas – those twisty distorted words you have to decipher and type in to access a website – have been around since the 1990s. Their original purpose was to tell the difference between a human user and an automated system, by requiring the user to answer a challenge – one that was supposedly hard for computers to solve, but easy for humans. A few years later they acquired the name CAPTCHA, an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”.

Optical character recognition was pretty inaccurate in the 90s, especially with blurry or misaligned text. Text that was clearly legible to a human was completely inscrutable to state of the art OCR software. The first developers of CAPTCHA took all the advice for getting accurate OCR scans, and did the opposite – intentionally creating text that would be readable, but impossible for OCR to parse. This worked fairly well to differentiate humans and robots for a while, but eventually technology began to catch up. Off the shelf OCR got better, and mechanical attacks specific to commonly used CAPTCHAs were getting good enough to answer a significant fraction correctly.

Attempting to combat that by making CAPTCHAs harder to solve mechanically also made them much harder for humans to solve, and was terrible for accessibility. That arms race had gone about as far as it could.

Meanwhile, a group at CMU realized that the millions of hours of human time wasted by solving CAPTCHAs could be applied to do useful work instead. They had a lot of scanned documents they wanted to digiitize, but the quality of the images was too poor to OCR. So they mechanically extracted single words from the documents and showed them, two at a time, to users as a CAPTCHA and asked them to enter the two words. They knew the right answer for one word, so if the user entered that word right they’d assume they probably got the unknown word right too (and, almost as an aside, allowed the user access). That was “reCAPTCHA”.

By making reCAPTCHA a hosted service it became much simpler for website owners to use CAPTCHAs, so they began to be used more widely. reCAPTCHA was acquired by Google in 2009 and they kept developing it. They used it to digitize street numbers for Google street view, and added “pick matching images” as an alternative puzzle. Increasingly, though, they didn’t actually need the humans to solve puzzles in the common case – they could tell from the history of the connecting IP address and the behaviour of the web browser that a user was likely legitimate, and let them through without making them do anything other than check a box. It was tracking reputation instead.

If you’ve ever used TOR – the secure browser that hides any identifying cookies and mixes your web traffic untraceably in with other TOR users you’ll have seen what the web looks like to users with a poor reputation. Every time you see a reCAPTCHA it’s not just a simple checkbox, it’s answering dozens of visual puzzles. That’s what most bots will see with reCAPTCHA.

reCAPTCHA is no longer really based on separating humans from bots so much as it’s differentiating between “normal”, “good” users – probably human – and “bad” users – probably bots – based on many characteristics of the users session. It’s really pretty good at that, and with the “just check a box” version of reCAPTCHA most users will see most of the time it’s pretty low friction. It’s probably the most effective tool against subscription bombing at the moment.

Google have just released version 3 of reCAPTCHA. This isn’t really a CAPTCHA at all, rather it’s a way to track the behaviour and reputation of users as visible to Google. It watches your users as they interact with your site, sends all that data to Google and they provide a quality score for each user that you can combine with your own information about them to make decisions.

It’s all very low friction, and probably very effective at detecting malicious bots. But it’s also a very intrusive user tracking technology that’ll send user history to Google whenever they’re on a site that uses it. The list of information it captures is definitely enough to uniquely fingerprint and track a user. It’ll be interesting to see what happens when that collides with the move towards web browsers being more privacy-focused and hostile to tracking.

Related Posts

Open subscription forms going away?

A few weeks ago, I got a call from a potential client. He was all angry and yelling because his ESP had kicked him off for spamming. “Only one person complained!! Do you know him? His name is Name. And I have signup data for him! He opted in! How can they kick me off for one complaint where I have opt-in data? Now they’re talking Spamhaus listings, Spamhaus can’t list me! I have opt-in data and IP addresses and everything.”
We talked briefly but decided that my involvement in this was not beneficial to either party. Not only do I know the complainant personally, I’ve also consulted with the ESP in question specifically to help them sort out their Spamhaus listings. I also know that if you run an open subscription form you are at risk for being a conduit for abuse.
This abuse is generally low level. A person might sign up someone else’s address in an effort to harass them. This is a problem for the victim, but doesn’t often result in any consequences for the sender. Last week’s SBL listings were a response to subscription abuse happening on a large scale.

Read More

Outrunning the Bear

bear
You’ve started to notice that your campaigns aren’t working as well as they used to. Your metrics suggest fewer people are clicking through, perhaps because more of your mail is ending up in junk folders. Maybe your outbound queues are bigger than they used to be.
You’ve not changed anything – you’re doing what’s worked well for years – and it’s not like you’ve suddenly had an influx of spamming customers (or, if you have, you’ve dealt with them much the same as you have in the past).
So what changed?
Everything else did. The email ecosystem is in a perpetual state of change.
There’s not a bright line that says “email must be this good to be delivered“.
rideInstead, most email filtering practice is based on trying to identify mail that users want, or don’t want, and delivering based on that. There’s some easy stuff – mail that can be easily identified as unwanted (malware, phishing, botnet spew) and mail that can easily be identified as wanted (SPF/DKIM authenticated mail from senders with clean content and a consistent history of sending mail that customers interact with and never mark as spam).
The hard bit is the greyer mail in the middle. Quite a lot of it may be wanted, but not easily identified as wanted mail. And a lot of it isn’t wanted, but not easily identified as spam. That’s where postmasters, filter vendors and reputation providers spend a lot of their effort on mitigation, monitoring recipient response to that mail and adapting their mail filtering to improve it.
Postmasters, and other filter operators, don’t really care about your political views or the products you’re trying to sell, nor do they make moral judgements about your legal content (some of the earliest adopters of best practices have been in the gambling and pornography space…). What they care about is making their recipients happy, making the best predictions they can about each incoming mail, based on the information they have. And one of the the most efficient ways to do that is to look at the grey area to see what mail is at the back of the pack, the least wanted, and focusing on blocking “mail like that”.
If you’re sending mail in that grey area – and as an ESP you probably are – you want to stay near the front or at least the middle of the grey area mailers, and definitely out of that “least wanted” back of the pack. Even if your mail isn’t great, competitors who are sending worse mail than you will probably feel more filtering pain and feel it sooner.
Some of those competitors are updating their practices for 2015, buying in to authentication, responding rapidly to complaints and feedback loop data, and preemptively terminating spammy customers – and by doing so they’re both sending mail that recipients want and making it easy for ISPs (and their postmasters and their machine learning systems) to recognize that they’re doing that.
Other competitors aren’t following this years best practices, have been lazy about providing customer-specific authentication, are letting new customers send spam with little oversight, and aren’t monitoring feedback and delivery to make sure they’re a good mail stream. They end up in the spam folder, their good customers migrate elsewhere because of “delivery issues” and bad actors move to them because they have a reputation for “not being picky about acquisition practices“. They risk spiraling into wholesale bulk foldering and becoming just a “bulletproof spam-friendly ESP”.
If you’re not improving your practices you’re probably being passed by your competitors who are, and you risk falling behind to the back of the pack.
And your competitors don’t need to outrun the bear, they just need to outrun you.

Read More

Warmup advice for Gmail

Getting to the Gmail inbox in concept is simple: send mail people want to receive. For a well established mail program with warm IPs and domains, getting to the inbox in practice is simple. Gmail uses recipient interaction with email to determine if an email is wanted or not. These interactions are easy when mail is delivered to the inbox, even if the user has tabs enabled.
When mail is in the bulk folder, even if it’s wanted, users are less likely to interact with the mail. Senders trying to change their reputation to get back to the inbox face an uphill battle. This doesn’t mean it’s impossible to get out of the bulk folder at Gmail, it’s absolutely possible. I have many clients who followed my advice and did it. Some of these clients were simply warming up new IPs and domains and needed to establish a reputation. Others were trying to repair a reputation. In both cases, the fixes are similar.

When I asked colleagues how they handled warmup at Gmail their answers were surprisingly similar to one another. They’re also very consistent with what I’ve seen work for clients.

Read More