Entries tagged with “scanning” from Tools of Change for Publishing
Magazines Now in Google Book Search
Google is adding back issues of magazines to its Book Search index. From the Official Google Blog:
Try queries like [obama keynote convention], [hollywood brat pack] or [world's most challenging crossword] and you'll find magazine articles alongside books results. Magazine articles are tagged with the keyword "Magazine" on the search snippet.
Over time, as we scan more articles, you'll see more and more magazines appear in Google Book Search results. Eventually, we'll also begin blending magazine results into our main Google.com search results, so you may begin finding magazines you didn't even know you were looking for. For now you can restrict your search to magazines we've scanned by trying an advanced search.
The Associated Press says Google will share advertising revenue generated by Google ads with magazine publishers. Embedded advertising from the original print editions remains intact as part of the overall archive. It'll be interesting to see how Google and magazine publishers coordinate on ads if/when publishers seed current editions into the service.
In recent months, Google also released a similar newspaper archive through Google News and a large collection of photos from LIFE magazine.
EFF Attorney: Google Book Search Settlement Weakens Innovation
In an editorial in The Recorder, Fred von Lohmann of the Electronic Frontier Foundation says Google's settlement with publishers and authors signals an implicit abandonment of Google's legal team working on behalf of innovation across Silicon Valley:
.. By settling rather than taking the case all the way ... Google has solved its own copyright problem -- but not anyone else's. Without a legal precedent about the copyright status of book scanning, future innovators are left to defend their own copyright lawsuits. In essence, Google has left its former copyright adversaries to maul any competitors that want to follow its lead.
Google will doubtless be considering the same endgame for the Viacom lawsuit against YouTube. If Google can strike a settlement with a large slice of the aggrieved copyright owners, then it solves the copyright problem for itself, while leaving it as a barrier to entry for YouTube's competitors.
But when innovators like Google cut individual deals, it weakens the Silicon Valley innovation ecology for everyone, because it leaves the smaller companies to carry on the fight against well-endowed opponents. Those kinds of cases threaten to yield bad legal precedents that tilt the rules against disruptive innovation generally.
Slides from "What Publishers Need to Know about Digitization" Webcast
TOC will be posting a complete recording of the presentation, but in the meantime I've posted the slides from yesterday's webcast, "What publishers need to know about digitization" on Slideshare.
Thanks to everyone who attended and especially to those who asked so many excellent questions.
Harvard Won't Permit Google Scans of In-Copyright Material
Harvard University Library (HUL) has been a partner in Google's library scanning project since 2004, but the boundaries of that partnership will not expand to the in-copyright works covered under Google's new Book Search settlement. From the Harvard Crimson:
In a letter released to library staff, University Library Director Robert C. Darnton '60 said that uncertainties in the settlement made it impossible for HUL to participate.
"As we understand it, the settlement contains too many potential limitations on access to and use of the books by members of the higher education community and by patrons of public libraries," Darnton wrote.
"The settlement provides no assurance that the prices charged for access will be reasonable," Darnton added, "especially since the subscription services will have no real competitors [and] the scope of access to the digitized books is in various ways both limited and uncertain."
The Crimson notes that Harvard will continue to allow scanning of books with expired copyrights.
The Analog Hole: Another Argument Against DRM
Digital rights management (DRM) might be unpopular with the public and plagued with social and technical challenges, but at least it's a guarantee that digital books can't be pirated — right?
Not so fast. Experienced computer crackers will find weaknesses in any encryption scheme, but regular folks with basic computer skills can exploit the one weakness found in all DRM'ed media: the analog hole.
What is the Analog Hole?
The "analog hole" reflects a basic principle of physics: before humans can consume any digital media, the ones and zeroes that computers understand must be converted into an analog format that our senses can perceive. For music, it's sound waves; for video and for digital books, it's patterns of light.
If you've ever visited a major metropolitan city you've probably seen the analog hole in action: street vendors selling pirated copies of popular movies, often months before they're officially released on DVD. Most of these are "cam" films, shot in real movie theaters using camcorders. Even without access to a physical copy of the film, pirates are able to capture its analog expression: the sound and pictures as perceived by a theater-goer.
In music, the analog hole is often used to get around software preventing digital copying. A user simply plays the the desired song on their computer using the legal DRM-enabled software, and records the audio coming out of their computer. Now they have a copy of the sound recording, which can be re-imported into the computer and digitally-encoded, with the original DRM stripped out. (A similar principle is at work when DRM systems go defunct and users are told to pirate their own music, although the industry uses the euphemism "making a backup.")
Film and music companies are painfully aware of the analog hole and have taken steps to close it, either by monitoring patron behavior (as in movie theaters) or by petitioning to legally limit the recording features of consumer electronics.
Because reading is a visual experience, there is the possibility of an analog hole exploit. Unlike with camcorder copies or re-burned MP3s, there is a potential for no loss in quality. And with a little ingenuity, the process can be completely automatic.
One example: Ebooks and Optical Character Recognition (OCR)
Here's a sample digital book as displayed in Adobe Digital Editions. (This book is public domain and isn't technically covered by DRM, but the principle is exactly the same.)
I hid as much of the Digital Editions menus as I could and took a screenshot of this first page of Pride and Prejudice.
Next I downloaded some free optical character recognition (OCR) software. OCR programs can "read" images and output the words in them as plain text. It's a normal part of digitization projects, in which archival printed material is first scanned and its text is automatically extracted. At the consumer level, OCR software is often bundled with commercial scanners and fax machines.
I took my screenshot and fed it to the OCR software. Here's what I got without any special fine-tuning or spell-checking. Note that all typos are from the OCR software.
Chapter 1
It is a truth universally acknowledged, that a single man in possession ofa large fortune must be in want of a wife, However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of someone or other of their daughters.
"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"
Mr. Bennet replied that he had not.
...and on through the entire first page. This output was in HTML, ready to be posted to the Web for anyone to read.
The OCR isn't 100 percent accurate, of course, but neither are the widely-available pirated ebooks created by laborious scanning of physical books, page after page. I was also using free software that requires careful fine-tuning to get working optimally; commercial OCR software is much better, especially when combined with spell-checking.
It wouldn't be difficult to automate the process of advancing one page in Digital Editions, taking a screenshot, and passing that on to my OCR software. Once the workflow was in place, I could strip hundreds or thousands of books of their DRM in a matter of minutes.
Another Possibility: Speech Recognition
My local library is kind enough to allow me to check out digital audiobooks. Naturally they're also secured with DRM (so much so that I can't actually play them, as they require Windows Media Player and I have only Mac and Linux computers). But assuming I could play them, I'd have available to me a nice stream of professionally-produced audio.
You're using speech recognition software every time you call a customer service line and an automated voice prompts you to speak your credit card number. If that's happened to you, you also know that speech recognition isn't 100 percent accurate yet, but under certain conditions it can be quite good. Automatic speech-to-text transcription isn't nearly as far along as optical character recognition, but it's another analog hole exploit that will eventually become trivial to perform.
Does This Mean Publishers Shouldn't Produce Ebooks or Audiobooks?
No! What I hope to convey is that DRM is not a true safeguard against ebook piracy. (It is, however, a known deterrent to ebook adoption.) I've heard a lot of passing the buck on DRM: publishers claim authors want it, booksellers claim publishers insist on it. These days it's hard to find someone to publicly state that they're actually for it.
I think of DRM like this: years ago my apartment was broken into and I called a locksmith to replace the door. My landlord had authorized me to get "the best lock possible," and the locksmith obliged with a four-foot steel bolt. It was almost too heavy to turn but made a very satisfying noise when it snapped shut.
I asked the locksmith, "Is this really unbreakable?"
"The lock is, sure." He slapped the door frame. "But this is made out of wood. If I really wanted to get in I'd just kick out the door. That's why I'm honest about what I sell." When I looked puzzled he handed me his business card. It contained his name, phone number, and company slogan: "A feeling of security."
Authors and publishers should be compensated for their talent and their hard work, and the desire for DRM is understandable. Book lovers, too, want their favorite authors to succeed. But digital books are a form of technology as much as they are literature, and technologies that are successful adapt to people's needs. Is just a "feeling" of security worth the ire of good customers who want to read their books wherever and however they like?
TOC Recommended Reading
In Defense of Piracy (Lawrence Lessig, Wall Street Journal)
The return of this "remix" culture could drive extraordinary economic growth, if encouraged, and properly balanced. It could return our culture to a practice that has marked every culture in human history -- save a few in the developed world for much of the 20th century -- where many create as well as consume. And it could inspire a deeper, much more meaningful practice of learning for a generation that has no time to read a book, but spends scores of hours each week listening, or watching or creating, "media."
Where is everybody? (Joe Wikert, TeleRead)
"If you build it, they will come" only works in the movies. If they really want to succeed Borders needs to do something beyond just making all this technology available in the store. Where are the in-store events (e.g., come let us help you research your family name, come see the latest e-book technologies, etc.)? How about signage in other areas of the store that promotes the tech kiosk area?
Mass book digitization: The deeper story of Google Books and the Open Content Alliance (Kalev Leetaru, First Monday)
Both projects offer the ability to search within a particular work, but only Google offers the ability to search across its entire collection. A search across the OCA archive only searches titles and description fields, not the full text of works. The OCA system thus offers a document-centric model, while Google offers both document and collection-based models, allowing broad exploratory searches of its entire holdings: the equivalent of being able to "full text search" a library. The importance of this difference cannot be understated in the limitations it places on the ability of patrons to interact with the OCA collections.
Google Scanning Newspaper Archives
Google is extending its scanning efforts to newspaper archives. From the New York Times:
Under the expanded program, Google will shoulder the cost of digitizing newspaper archives, much as the company does with its book-scanning project. Google angered some book publishers because it had failed to seek permission to scan books that were protected by copyrights. It will obtain permission from newspaper publishers before scanning their archives.
Google ... will place advertisements alongside search results, and share the revenue from those ads with newspaper publishers.
The Times says some archived results are currently available through Google News and newspapers will eventually be able to offer archival searches through their own sites.
Google Taking Long View on Book Search
With Microsoft abandoning Live Search Books, eWEEK turns the spotlight on Google Book Search:
... the smart strategy would be for Google to advance its effort from the "not-too-distant future" to the present. Google can pretty much corner the market at this point. Google was asked by eWEEK when it could expect to see some Book Search results, but the spokesperson declined to comment.
(Note: by "results," I'm assuming the article's author is talking about financial returns.)
Despite the short-term opportunity presented by Microsoft's departure, a quoted analyst in the eWEEK piece believes Google is taking a long-view approach to Book Search.
(Via Publishers Weekly.)
Long-Term Questions Around Google and Content
Martyn Daniels offers long-view questions around Google's copying of content from publisher books:
Publishers have in many cases argued it is healthy to give them [Google] content as they drive up sales, others that they are stealing it. Whatever your viewpoint the question that must be answered is what do they intend to do with it tomorrow? Will they always us it as they do today? Can they re assign it to others, either in part or whole? Can the copyright owner revert rights, given or taken, if the copyright ownership of the original work changes? Can the originator object? History is littered with cases where the result was not what people expected to happen at the beginning and where market dominance created a new venture not previously envisaged.
Publishing is a rights business yet we often seem to struggle managing them and the older the content the murkier rights become. Today is the right time to revisit the question of Google's Book programme and not continue to go blindly forward as if nothing has changed.
A Google-Amazon Mobile Application?
Android Scan, one of the winners from the Google Android Developer Challenge, uses cell phone cameras and barcode recognition to tap into Amazon's review database. From Silicon Valley Insider:
Scan barcodes on any book or CD when you’re in a store and your phone will pull up Amazon reviews and check local library listings to see if the book is in stock.
Why it's cool: Google’s been pushing mobile barcode scanning, so they might dig this app, too. We assume the developers have included their Amazon referral code in the app so they get a 5%+ commission on any purchases you make, too.
A Glimpse into Google's Book Scanning
Google doesn't divulge specifics about its proprietary book scanning set-up, but the Associated Press offers a brief look into the manual scanning process used for old/fragile titles:
... the temperature is always in the 60s ... Each technician has a slightly angled table with a flexible middle that cradles books and holds them still while two overhead cameras photograph the pages. ... Once the images reach the computer, the women [featured in the AP story] use the book scanning software Omniscan from Germany's Zeutschel GmbH to clean them up. A final click of the mouse sends each digitized book to Google for optical character recognition processing, which makes the text searchable. Google then returns a copy of the images and data to the library and posts another to the Web.
(Via Publishers Weekly)
- Stay Connected
-

TOC RSS Feeds
News Posts
Commentary Posts
Combined Feed
New to RSS?
Subscribe to the TOC newsletter. 
Follow TOC on Twitter. 
Join the TOC Facebook group. 
Join the TOC LinkedIn group. 
Get the TOC Headline Widget.
- Search
-
