Entries tagged with “ocr” from Tools of Change for Publishing

Slides from "What Publishers Need to Know about Digitization" Webcast

TOC will be posting a complete recording of the presentation, but in the meantime I've posted the slides from yesterday's webcast, "What publishers need to know about digitization" on Slideshare.

Thanks to everyone who attended and especially to those who asked so many excellent questions.

View SlideShare presentation or Upload your own. (tags: schema epub)

Read more…

The Analog Hole: Another Argument Against DRM

Digital rights management (DRM) might be unpopular with the public and plagued with social and technical challenges, but at least it's a guarantee that digital books can't be pirated — right?

Not so fast. Experienced computer crackers will find weaknesses in any encryption scheme, but regular folks with basic computer skills can exploit the one weakness found in all DRM'ed media: the analog hole.

What is the Analog Hole?

The "analog hole" reflects a basic principle of physics: before humans can consume any digital media, the ones and zeroes that computers understand must be converted into an analog format that our senses can perceive. For music, it's sound waves; for video and for digital books, it's patterns of light.

If you've ever visited a major metropolitan city you've probably seen the analog hole in action: street vendors selling pirated copies of popular movies, often months before they're officially released on DVD. Most of these are "cam" films, shot in real movie theaters using camcorders. Even without access to a physical copy of the film, pirates are able to capture its analog expression: the sound and pictures as perceived by a theater-goer.

In music, the analog hole is often used to get around software preventing digital copying. A user simply plays the the desired song on their computer using the legal DRM-enabled software, and records the audio coming out of their computer. Now they have a copy of the sound recording, which can be re-imported into the computer and digitally-encoded, with the original DRM stripped out. (A similar principle is at work when DRM systems go defunct and users are told to pirate their own music, although the industry uses the euphemism "making a backup.")

Film and music companies are painfully aware of the analog hole and have taken steps to close it, either by monitoring patron behavior (as in movie theaters) or by petitioning to legally limit the recording features of consumer electronics.

Because reading is a visual experience, there is the possibility of an analog hole exploit. Unlike with camcorder copies or re-burned MP3s, there is a potential for no loss in quality. And with a little ingenuity, the process can be completely automatic.

One example: Ebooks and Optical Character Recognition (OCR)

Here's a sample digital book as displayed in Adobe Digital Editions. (This book is public domain and isn't technically covered by DRM, but the principle is exactly the same.)

pride-chapter-one.png

I hid as much of the Digital Editions menus as I could and took a screenshot of this first page of Pride and Prejudice.

Next I downloaded some free optical character recognition (OCR) software. OCR programs can "read" images and output the words in them as plain text. It's a normal part of digitization projects, in which archival printed material is first scanned and its text is automatically extracted. At the consumer level, OCR software is often bundled with commercial scanners and fax machines.

I took my screenshot and fed it to the OCR software. Here's what I got without any special fine-tuning or spell-checking. Note that all typos are from the OCR software.

Chapter 1
It is a truth universally acknowledged, that a single man in possession ofa large fortune must be in want of a wife, However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of someone or other of their daughters.
"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"
Mr. Bennet replied that he had not.

...and on through the entire first page. This output was in HTML, ready to be posted to the Web for anyone to read.

The OCR isn't 100 percent accurate, of course, but neither are the widely-available pirated ebooks created by laborious scanning of physical books, page after page. I was also using free software that requires careful fine-tuning to get working optimally; commercial OCR software is much better, especially when combined with spell-checking.

It wouldn't be difficult to automate the process of advancing one page in Digital Editions, taking a screenshot, and passing that on to my OCR software. Once the workflow was in place, I could strip hundreds or thousands of books of their DRM in a matter of minutes.

Another Possibility: Speech Recognition

My local library is kind enough to allow me to check out digital audiobooks. Naturally they're also secured with DRM (so much so that I can't actually play them, as they require Windows Media Player and I have only Mac and Linux computers). But assuming I could play them, I'd have available to me a nice stream of professionally-produced audio.

You're using speech recognition software every time you call a customer service line and an automated voice prompts you to speak your credit card number. If that's happened to you, you also know that speech recognition isn't 100 percent accurate yet, but under certain conditions it can be quite good. Automatic speech-to-text transcription isn't nearly as far along as optical character recognition, but it's another analog hole exploit that will eventually become trivial to perform.

Does This Mean Publishers Shouldn't Produce Ebooks or Audiobooks?

No! What I hope to convey is that DRM is not a true safeguard against ebook piracy. (It is, however, a known deterrent to ebook adoption.) I've heard a lot of passing the buck on DRM: publishers claim authors want it, booksellers claim publishers insist on it. These days it's hard to find someone to publicly state that they're actually for it.

I think of DRM like this: years ago my apartment was broken into and I called a locksmith to replace the door. My landlord had authorized me to get "the best lock possible," and the locksmith obliged with a four-foot steel bolt. It was almost too heavy to turn but made a very satisfying noise when it snapped shut.

I asked the locksmith, "Is this really unbreakable?"

"The lock is, sure." He slapped the door frame. "But this is made out of wood. If I really wanted to get in I'd just kick out the door. That's why I'm honest about what I sell." When I looked puzzled he handed me his business card. It contained his name, phone number, and company slogan: "A feeling of security."

Authors and publishers should be compensated for their talent and their hard work, and the desire for DRM is understandable. Book lovers, too, want their favorite authors to succeed. But digital books are a form of technology as much as they are literature, and technologies that are successful adapt to people's needs. Is just a "feeling" of security worth the ire of good customers who want to read their books wherever and however they like?

Stay Connected
RSS TOC RSS Feeds
 News Posts
 Commentary Posts
 Combined Feed
 New to RSS?
Newsletter Subscribe to the TOC newsletter.
Tarsier Icon Follow TOC on Twitter.
Newsletter Join the TOC Facebook group.
Newsletter Join the TOC LinkedIn group.
TOC Widget Get the TOC Headline Widget.
Search
Tag Cloud