Thursday, February 4, 2010

Copyright-Safe Full-Text Indexing of Books

As the February 18 hearing on the revised Google Books Settlement Agreement draws near, I think its timely to explore some issues surrounding full-text indexing of books. It's important to realize that when Google began its program of scanning books in libraries, it chose to do so in a way that entered the gray zone of fair use. Google continues to maintain that its scanning activities are perfectly legal, and fair use advocates welcomed the Publishers' and Authors' lawsuit because it had the potential to clarify ambiguities around fair use. No matter where the court decided to draw the line, the both fair use and rightsholder control would be able to extend into the zone of current uncertainty.

Overlooked in the controversy is the fact that Google could have chosen a safer course in its effort to make full-text indices of books. In this article, I'll argue that it's possible to make full-text indices of books in a way that steers well clear of copyright infringement. But first, I should note that playing it safe would not have been a good plan for Google. By pushing fair use to its limits, Google assured itself a favorable competitive position. In a lawsuit, Google could have lost on 90% of the fair use they were claiming and would still have ended up 10% ahead of where a safe course would have taken them. Google is large enough that even a 10% victory in court would have paid off in the long run. As it is, Google chose to settle the lawsuit under terms that put them in a better position than they would have occupied by playing it safe, and potential competitors don't gain the benefits of a fair-use precedent.

I make two assumptions about copyright in devising an copyright-safe indexing method:
  1. You can't infringe the copyright to a work if you don't copy the work.
  2. If you can't reconstruct a work from its index, then distributing copies of the index doesn't infringe on the work's copyright.
Just in case these assumptions are weak, my fall-back position is that indexing is clearly a fair use under US copyright law.

First, the fall-back assumption: full-text indexing is allowed as fair use under US copyright law. Indices are allowed as "transformative uses". Judge Robert Patterson's decision (pdf, 195K) in the "Harry Potter Lexicon" case gives an excellent background of this jurisprudence and concludes:
The purpose of the Lexicon’s use of the Harry Potter series is transformative. Presumably, Rowling created the Harry Potter series for the expressive purpose of telling an entertaining and thought provoking story centered on the character Harry Potter and set in a magical world. The Lexicon, on the other hand, uses material from the series for the practical purpose of making information about the intricate world of Harry Potter readily accessible to readers in a reference guide. To fulfill this function, the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.
The author of the Lexicon lost his case not because his indexing was not allowed, but rather because he copied too much of J. K. Rowling's creative expression in doing so.

Second, you have to copy to infringe copyright. A more accurate statement is this: You have to either make a copy or a derivative work to infringe copyright. The second piece of this can be a bit more confusing, because "derivative work" has a specific meaning in copyright law. A translation into another language is an example of a derivative work. Indices are not derivative works. The law considers indices to be more akin to metadata. I might need access to a book to count the number of figures it contains, but a report of the number of figures in a book and what page they're on is in no way a derivative work. The copyright act defines a derivative work as
a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.
If you make copies by scanning, however, as Google is doing, you must also establish that your use is allowed as fair use. If you don't, then you don't even need to reach the fair use provision.

The last assumption gets more technical. The simplest form of a word index is a sorted list of words with pointers to the occurrence of the word within the text. So an index of that last sentence might look like this:
a    5,9
form    3
index    7
is    8
list    11
occurrence    18
of    4,12,19
pointers    15
simplest    2
sorted    10
text    24
the    1,17,20,23
to    16
with    14
within    22
word    6,21
words    13
It doesn't take a computer science degree to see that it's easy to reconstruct the sentence from this index. For that reason this form of index is equivalent to a copy. If you remove the position pointers, however, the index loses enough information that the sentence cannot be reconstructed. So if we take the words on a page of text and sort the words in each sentence, then sort the word-sorted sentences, we get an index of a page that can't be used to reconstruct text, but can be used to build a useful full-text index of a book.

The trickiest step of completely copyright-safe indexing is producing the page index from a book without producing intermediate copies of the pages. In a conventional scanning process, a digital image of a page is stored to disk and the copy is passed to OCR software. Indexing software then works on the OCR text. A scanning process that was fastidious about copyright, however, could scan lines of text word by word and never acquire an image large enough to be subject to copyright.

US courts have considered the loading of a copyrightable work into a computer's RAM storage to constitute copying, but scanning sufficient to produce an index can in principle be done without requiring that to occur. (For an excellent law review article on the RAM-copying situation, read Jonathan Band and Jeny Marcinko's article in Stanford Technology Law Review.) Also, even sentences of more than a few words can be considered copyrightable works, as I discussed in an article from November.

Another possible way to avoid copying is to build a black-box indexer. A closer look at the RAM-copying precedent, MAI SYSTEMS v. PEAK COMPUTER suggests that a non-copying scanning indexer can be built even if page images exist somewhere in RAM. In that case, the court reasoned that the software copy could be viewed via terminal readouts, system logs, and that sort of thing. If a closed-box indexing system were built so that page images resident in RAM could never be "perceived, reproduced, or otherwise communicated", then there is a fair chance that a court would find that copying was not occurring.

I'm a technologist, not a lawyer. I would welcome comment and criticism from experts of all stripes on this analysis. For example, I've not considered international aspects at all. There are many technical aspects of copyright-safe indexing that would need to be sorted out, but doing so could open the way to countless transformative uses of all the books in the world.
Enhanced by Zemanta

1 comment:

  1. You know, it's reading this kind of careful, detailed analysis -- especially the bit about scanning only one line at a time because having more than that in memory at once might be considered "copying" -- that leads an increasing number of people to very simple conclusions such as "copyright is dumb". It's getting harder and harder to disagree with them.

    ReplyDelete

Note: Only a member of this blog may post a comment.