Monday, February 13, 2017

Miseducation

There's a fascinating video created by the Southern Poverty Law Center (in January 2017) that focuses on Google but is equally relevant to libraries. It is called The Miseducation of Dylann Roof.

 

 In this video, the speaker shows that by searching on "black on white violence" in Google the top items are all from racist sites. Each of these link only to other racist sites. The speaker claims that Google's algorithms will favor similar sites to ones that a user has visited from a Google search, and that eventually, in this case, the user's online searching will be skewed toward sites that are racist in nature. The claim is that this is what happened to Dylan Roof, the man who killed 9 people at an historic African-American church - he entered a closed information system that consisted only of racist sites. It ends by saying: "It's a fundamental problem that Google must address if it is truly going to be the world's library."

I'm not going to defend or deny the claims of the video, and you should watch it yourself because I'm not giving a full exposition of its premise here (and it is short and very interesting). But I do want to question whether Google is or could be "the world's library", and also whether libraries do a sufficient job of presenting users with a well-round information space.

It's fairly easy to dismiss the first premise - that Google is or should be seen as a library. Google is operating in a significantly different information ecosystem from libraries. While there is some overlap between Google and library collections, primarily because Google now partners with publishers to index some books, there is much that is on the Internet that is not in libraries, and a significant amount that is in libraries but not available online. Libraries pride themselves on providing quality information, but we can't really take the lion's share of the credit for that; the primary gatekeepers are the publishers from whom we purchase the items in our collections. In terms of content, most libraries are pretty staid, collecting only from mainstream publishers.

I decided to test this out and went looking for works promoting Holocaust denial or Creationism in a non-random group of libraries. I was able to find numerous books about deniers and denial, but only research libraries seem to carry the books by the deniers themselves. None of these come from mainstream publishing houses. I note that the subject heading, Holocaust denial literature, is applied to both those items written from the denial point of view, as well as ones analyzing or debating that view.

Creationism gets a bit more visibility; I was able to find some creationist works in public libraries in the Bible Belt. Again, there is a single subject heading, Creationism, that covers both the pro- and the con-. Finding pro- works in WorldCat is a kind of "needle in a haystack" exercise.

Don't dwell too much on my findings - this is purely anecdotal, although a true study would be fascinating. We know that libraries to some extent reflect their local cultures, such as the presence of the Gay and Lesbian Archives at the San Francisco Public Library.  But you often hear that libraries "cover all points of view," which is not really true.

The common statement about libraries is that we gather materials on all sides of an issue. Another statement is that users will discover them because they will reside near each other on the library shelves. Is this true? Is this adequate? Does this guarantee that library users will encounter a full range of thoughts and facts on an issue?

First, just because the library has more than one book on a topic does not guarantee that a user will choose to engage with multiple sources. There are people who seek out everything they can find on a topic, but as we know from the general statistics on reading habits, many people will not read voraciously on a topic. So the fact that the library has multiple items with different points of view doesn't mean that the user reads all of those points of view.

Second, there can be a big difference between what the library holds and what a user finds on the shelf. Many public libraries have a high rate of circulation of a large part of their collection, and some books have such long holds lists that they may not hit the shelf for months or longer. I have no way to predict what a user would find on the shelf in a library that had an equal number of books expounding the science of evolution vs those promoting the biblical concept of creation, but it is frightening to think that what a person learns will be the result of some random library bookshelf.

But the third point is really the key one: libraries do not cover all points of view, if by points of view you include the kind of mis-information that is described in the SPLC video. There are many points of view that are not available from mainstream publishers, and there are many points of view that are not considered appropriate for anything but serious study. A researcher looking into race relations in the United States today would find the sites that attracted Roof to provide important insights, as SPLC did, but you will not find that same information in a "reading" library.

Libraries have an idea of "appropriate" that they share with the publishing community. We are both scientific and moral gatekeepers, whether we want to admit it or not. Google is an algorithm functioning over an uncontrolled and uncontrollable number of conversations. Although Google pretends that its algorithm is neutral, we know that it is not. On Amazon, which does accept self-published and alternative press books, certain content like pornography is consciously kept away from promotions and best seller lists. Google has "tweaked" its algorithms to remove Holocaust denial literature from view in some European countries that forbid the topic. The video essentially says that Google should make wide-ranging cultural, scientific and moral judgments about the content it indexes.

I am of two minds about the idea of letting Google or Amazon be a gatekeeper. On the one hand, immersing a Dylann Roof in an online racist community is a terrible thing, and we see the result (although the cause and effect may be hard to prove as strongly as the video shows). On the other hand, letting Google and Amazon decide what is and what is not appropriate does not sit well at all. As I've said before having gatekeepers whose motivations are trade secrets that cannot be discussed is quite dangerous.

There has been a lot of discussion lately about libraries and their supposed neutrality. I am very glad that we can have that discussion. With all of the current hoopla about fake news, Russian hackers, and the use of social media to target and change opinion, we should embrace the fact of our collection policies, and admit widely that we and others have thought carefully about the content of the library. It won't be the most radical in many cases, but we care about veracity, and that''s something that Google cannot say.

Sunday, December 18, 2016

Transparency of judgment

The Guardian, and others, have discovered that when querying Google for "did the Holocaust really happen", the top response is a Holocaust denier site. They mistakenly think that the solution is to lower the ranking of that site.

The real solution, however, is different. It begins with the very concept of the "top site" from searches. What does "top site" really mean? It means something like "the site most often pointed to by other sites that are most often pointed to." It means "popular" -- but by an unexamined measure. Google's algorithm doesn't distinguish fact from fiction, or scientific from nutty, or even academically viable from warm and fuzzy. Fan sites compete with the list of publications of a Nobel prize-winning physicist. Well, except that they probably don't, because it would be odd for the same search terms to pull up both, but nothing in the ranking itself makes that distinction.

The primary problem with Google's result, however, is that it hides the relationships that the algorithm itself uses in the ranking. You get something ranked #1 but you have no idea how Google arrived at that ranking; that's a trade secret. By not giving the user any information on what lies behind the ranking of that specific page you eliminate the user's possibility to make an informed judgment about the source. This informed judgment is not only about the inherent quality of the information in the ranked site, but also about its position in the complex social interactions surrounding knowledge creation itself.

This is true not only for Holocaust denial but every single site on the web. It is also true for every document that is on library shelves or servers. It is not sufficient to look at any cultural artifact as an isolated case, because there are no isolated cases. It is all about context, and the threads of history and thought that surround the thoughts presented in the document.

There is an interesting project of the Wikimedia Foundation called "Wikicite." The goal of that project is to make sure that specific facts culled from Wikipedia into the Wikidata project all have citations that support the facts. If you've done any work on Wikipedia you know that all statements of fact in all articles must come from reliable third-party sources. These citations allow one to discover the background for the information in Wikipedia, and to use that to decide for oneself if the information in the article is reliable, and also to know what points of view are represented. A map of the data that leads to a web site's ranking on Google would serve a similar function.

Another interesting project is CITO, the Citation Typing Ontology. This is aimed at scholarly works, and it is a vocabulary that would allow authors to do more than just cite a work - they could give a more specific meaning to the citation, such as "disputes", "extends", "gives support to". A citation index could then categorize citations so that you could see who are the deniers of the deniers as well as the supporters, rather than just counting citations. This brings us a small step, but a step, closer to a knowledge map.

All judgments of importance or even relative position of information sources must be transparent. Anything else denies the value of careful thinking about our world. Google counts pages and pretends not to be passing judgment on information, but they operate under a false flag of neutrality that protects their bottom line. The rest of us need to do better.



Tuesday, December 13, 2016

All the (good) books

I have just re-read Ray Bradbury's Fahrenheit 451, and it was better than I had remembered. It holds up very well for a book first published in 1953. I was reading it as an example of book worship, as part of my investigation into examples of an irrational love of books. What became clear, however, is that this book does not describe an indiscriminate love, not at all.

I took note of the authors and the individual books that are actually mentioned in Fahrenheit 451. Here they are (hopefully a complete list):

Authors: Dante, Swift, Marcus Aurelius, Shakespeare, Plato, Milton, Sophocles, Thomas Hardy, Ortega y Gasset, Schweitzer, Einstein, Darwin, Gandhi, Guatama Buddha, Confucius, Thomas Love Peacock, Thomas Jefferson, Lincoln, Tom Paine, Machiavelli, Christ, Bertrand Russell.
Books: Little Black Sambo, Uncle Tom's Cabin, the Bible, Walden
I suspect that by the criteria with which Bradbury chose his authors, he himself, merely an author of popular science fiction, would not have made his own list. Of the books, the first two were used to illustrate books that offended.
"Don't step on the toes of the dog lovers, the cat lovers, doctors, lawyers, merchants, chiefs, Mormons, Baptists, Unitarians, second-generation Chinese, Swedes, Italians, Germans, Texans, Brooklynites, Irishmen, people from Oregon or Mexico."
...
"Colored people don't like Little Black Sambo. Burn it. White people don't feel good about Uncle Tom's Cabin. Burn it. Someone's written a book on tobacco and cancers of the lungs? The cigarette people are weeping? Burn the book. Serenity, Montag."
The other two were examples of books that were being preserved.

Bradbury was a bit of a social curmudgeon, and in terms of books decidedly a traditionalist. He decried the dumbing down of American culture, with digests of books (perhaps prompted by the Reader's Digest brand, which began in 1950), then "digest-digests, digest-digest-digests," then with books being reduced to one or two sentences, and television keeping people occupied but without any perceptible content. (Although he pre-invents a number of recognizable modern technologies, such as earbuds, he fails to anticipate the popular of writers like George R. R. Martin and other writers of brick-sized tomes.)

Fahrenheit 451 is not a worship of books, but of their role in preserving a certain culture. The "book-people" who each had memorized a book or a chapter hoped to see those become the canon once the new "dark ages" had ended. This was not a preservation of all books but of a small selection of books. That is, of course, exactly what happened in the original dark ages, although the potential corpus then was much smaller: only those texts that had been carefully copied and preserved, and in small numbers, were available for distribution once printing technology became available. Those manuscripts were converted to printed texts, and the light came back on in Europe, albeit with some dark corners un-illuminated where texts had been lost.

Another interesting author on the topic of preservation, but less well-known, is Louis-Sébastien Mercier, writing in 1772 in his utopian novel of the future, Memoirs of the Year Two Thousand Five Hundred.*  In his book he visits the King's Library in that year to find that there is only a small cabinet holding the entire book collection. He asks the librarians whether some great fire had destroyed the books, but they answered instead that it was a conscious selection.
"Nothing leads the mind farther astray than bad books; for the first notions being adopted without attention, the second become precipitate conclusion; and men thus go on from prejudice to prejudice, and from error to error. What remained for us to do, but to rebuild the structure of human knowledge?" (v. 2, p. 5)
The selection criteria eliminated commentaries ("works of envy or ignorance") but kept original works of discovery or philosophy. These people also saw a virtue in abridging works to save the time of the reader. Not all works that we would consider "classics" were retained:
"In the second division, appropriated to the Latin authors, I found Virgil, Pliny, and Titus Livy entire; but they had burned Lucretius, except some poetic passages, because his physics they found false, and his morals dangerous." (v. 2, p.9)
In this case, books are selectively burned because they are considered inferior, a waste of the reader's time or tending to lead one in a less than moral direction. Although Mercier doesn't say so, he is implying a problem of information overload.

In Bradbury's book the goal was to empty the minds of the population, make them passive, not thinking. Mercier's world was gathering all of the best of human knowledge, perhaps even re-writing it, as Paul Otlet proposed. (More on him in a moment.) Mercier's year 2500 world eliminated all the works of commentary on other works, treating them like unimportant rantings on today's social networks. Bradbury also did not mention secondary sources; he names no authors of history (although we don't know how he thought of Bertrand Russell, as philosopher or also a historian) or works of literary criticism.

Both Bradbury and Mercier would be considered well-read. But we are all like the blind men and the elephant. We all operate based on the information we have. Bradbury and Mercier each had very different minds because they had been informed by what they had read. For the mind it is "you are what you see and read." Mercier could not have named Thoreau and Bradbury did not mention any French philosophers. Had they each saved a segment of the written output of history their choices would have been very different with little overlap, although they both explicitly retain Shakespeare. Their goals, however, run in parallel, and in both cases the goal is to preserve those works that merit preserving so that they can be read now and in the future.  

In another approach to culling the mass of books and other papers, Kurt Vonnegut, in his absurdist manner, addressed the problem as one of information overload:
"In the year Ten Million, according to Koradubian, there would be a tremendous house-cleaning. All records relating to the period between the death of Christ and the year One Million A.D. would be hauled to dumps and burned. This would be done, said Koradubian, because museums and archives would be crowding the living right off the earth. 
The million-year period to which the burned junk related would be summed up in history books in one sentence, according to Koradubian: Following the death of Jesus Christ, there was a period of readjustment that lasted for approximately one million years." (Sirens of Titan, p. 46)
While one hears often about a passion for books, some disciplines rely on other types of publications, such as journal articles and conference papers. The passion for books rarely includes these except occasionally by mistake, such as the bound journals that were scanned by Google in its wholesale digitization of library shelves, and the aficionados of non-books are generally limited to specific forms, such as comic books. In the late 19th and early 20th century, Belgian Paul Otlet, a fascinating obsessive whose lifetime and interests coincided with that our own homegrown bibliographic obsessive, Melvil Dewey, began work leading to his creation of what was intended to be a universal bibliography that included both books and journal articles, as well as other publications. Otlet's project was aimed at all knowledge, not just that contained in books, and his organization solicited books and journals from European and North American learned societies, especially those operating in scientific areas. As befits a project with the grandiose goal of cataloging all of the world's information, Otlet named it the Mundaneum. Otlet represents another selection criterion, because his Mundaneum appears to have been limited to academic materials and serious works; at the least, there is no mention of fiction or poetry in what I have read on the topic.

Among Otlet's goals was to pull out information buried in books and bring related bits of information together. He called the result of this a Biblion. This Biblion sounds somewhat related to the abridgments and re-gatherings of information that Mercier describes in his book. It also sounds like what motivated the early encyclopedists. To Otlet, the book format was a barrier, since his goal was not the preservation of the volumes themselves, but was to be a centralized knowledge base.

So now we have a range of book preservation goals, from all the books to all the good books, and then to the useful information in books. Within the latter two we see that each selection represents a fairly limited viewpoint that would result in a loss of a large number of the books and other materials that are held in research libraries today. For those of us in libraries and archives, the need is to optimize quality without being arbitrary, and at the same time to serve a broad intellectual and creative base. We won't be as perfect as Otlet or as strict the librarians in the year 2500, but hopefully our preservation practices will be more predictable than the individual choices made by Bradbury's "human books."



* In the original French, the title referred to the year 2440 ("L'An 2440, rêve s'il en fut jamais"). I have no idea why it was rounded up to 2500 in the English translation.

Works cited or used


Bradbury, Ray. Fahrenheit 451. New York: Ballantine, 1953

Mercier, Louis-Sébastien. Memoirs of the year two thousand five hundred, London, Printed for G. Robinson, 1772 (HathiTrust copy)

Vonnegut, Kurt. The Sirens of Titan. New York: Dial Press Trade Paperbacks, 2006

 Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. New York, NY : Oxford University Press, 2014.

Monday, November 28, 2016

All the Books

I just joined the Book of the Month Club. This is a throwback to my childhood, because my parents were members when I was young, and I still have some of the books they received through the club. I joined because my reading habits are narrowing, and I need someone to recommend books to me. And that brings me to "All the Books."

"All the Books" is a writing project I've had on my computer and in notes ever since Google announced that it was digitizing all the books in the world. (It did not do this.) The project was lauded in an article by Kevin Kelley in the New York Times Magazine of May 14, 2006, which he prefaced with:

"What will happen to books? Reader, take heart! Publisher, be very, very afraid. Internet search engines will set them free. A manifesto."

There are a number of things to say about All the Books. First, one would need to define "All" and "Books". (We can probably take "the" as it is.) The Google scanning projects defined this as "all the bound volumes on the shelves of certain libraries, unless they had physical problems that prevented scanning." This of course defines neither "All" nor "Books".

Next, one would need to gather the use cases for this digital corpus. Through the HathiTrust project we know that a small number of scholars are using the digital files for research into language usage over time. Others are using the the files to search for specific words or names, discovering new sources of information about possibly obscure topics. As far as I can tell, no one is using these files to read books. The Open Library, on the other hand, is lending digitized books as ebooks for reading. This brings us to the statement that was made by a Questia sales person many years ago, when there were no ebooks and screens were those flickery CRTs: "Our books are for research, not reading." Given that their audience was undergraduate students trying to finish a paper by 9:30 a.m. the next morning, this was an actual use case with actual users. But the fact that one does research in texts one does not read is, of course, not ideal from a knowledge acquisition point of view.

My biggest beef with "All the Books" is that it treats them as an undifferentiated mass, as if all the books are equal. I always come back to the fact that if you read one book every week for 60 years (which is a good pace) you will have read 3,120. Up that to two books a week and you've covered 6,240 of the estimated 200-300 million books represented in WorldCat. The problem isn't that we don't have enough books to read; the problem is finding the 3-6,000 books that will give us the knowledge we need to face life, and be a source of pleasure while we do so. "All the Books" ignores the heights of knowledge, of culture, and of art that can be found in some of the books. Like Sarah Palin's response to the question "Which newspapers form your world view?", "all of them" is inherently an anti-intellectual answer, either by someone who doesn't read any of them, or who isn't able to distinguish the differences.

"All the Books" is a complex concept. It includes religious identity; the effect of printing on book dissemination; the loss of Latin as a universal language for scholars; the rise of non-textual media. I hope to hunker down and write this piece, but meanwhile, this is a taste.

Monday, September 26, 2016

2 Mysteries Solved!

One of the disadvantages of a long tradition is that the reasons behind certain practices can be lost over time. This is definitely the case with many practices in libraries, and in particular in practices affecting the library catalog. In U.S. libraries we tend to date our cataloging practices back to Panizzi, in the 1830's, but I suspect that he was already building on practices that preceded him.

A particular problem with this loss of history is that without the information about why a certain practice was chosen it becomes difficult to know if or when you can change the practice. This is compounded in libraries by the existence of entries in our catalogs that were created long before us and by colleagues whom we can no longer consult.


I was recently reading through volume one of the American Library Journal from the year 1876-1877. The American Library Association had been founded in 1876 and had its first meeting in Philadelphia in September, 1876. U.S. librarianship finally had a focal point for professional development. From the initial conference there were a number of ALA committees working on problems of interest to the library community. A Committee on Cooperative Cataloguing, led by Melvil Dewey, (who had not yet been able to remove the "u" from "cataloguing") was proposing that cataloging of books be done once, centrally, and shared, at a modest cost, with other libraries that purchased the same book. This was realized in 1902 when the Library of Congress began selling printed card sets. We still have cooperative cataloging, 140 years later, and it has had a profound effect on the ability of American libraries to reduce the cost of catalog creation.

Other practices were set in motion in 1876-1877, and two of these can be found in that inaugural volume. They are also practices whose rationales have not been obvious to me, so I was very glad to solve these mysteries.

Title case

Some time ago I asked on Autocat, out of curiosity, why libraries use sentence case for titles. No one who replied had more than a speculative answer. In 1877, however, Charles Ammi Cutter reports on The Use of Capitals in library cataloging and defines a set of rules that can be followed. His main impetus is "readability;" that "a profusion of capitals confuses rather than assists the eye...." (He also mentions that this is not a problem with the Bodleian library catalog, as that is written in Latin.)

Cutter would have preferred that capitals be confined to proper names, eschewing their use for titles of honor (Rev., Mrs., Earl) and initialisms (A.D). However, he said that these uses were so common that he didn't expect to see them changed, and so he conceded them.

All in all, I think you will find his rules quite compelling. I haven't looked at how they compare to any such rules in RDA. So much still to do!

Centimeters

I have often pointed out, although it would be obvious to anyone who has the time to question the practice, that books are measured in centimeters in Anglo-American catalogs, although there are few cultures as insistent on measuring in inches and feet than those. It is particularly un-helpful that books in libraries are cataloged with a height measurement in centimeters while the shelves that they are destined for are measured in inches. It is true that the measurement forms part of the description of the book, but at least one use of that is to determine on which shelves those books can be placed. (Note that in some storage facilities, book shelves are more variable in height than in general library collections and the size determination allows for more compact storage.) If I were to shout out to you "37 centimeters" you would probably be hard-pressed to reply quickly with the same measurement in inches. So why do we use centimeters?

The newly-formed American Library Association had a Committee on Sizes. This committee had been given the task of developing a set of standard size designations for books. The "size question" had to do with the then current practice to list sizes as folio, quarto, etc. Apparently the rise of modern paper making and printing meant that those were no longer the actual sizes of books. In the article by Charles Evans (pp. 56-61) he argued that actual measurements of the books, in inches, should replace the previous list of standard sizes. However, later, the use of inches was questioned. At the ALA meeting, W.F. Poole (of Poole's indexes) made the following statement (p. 109):
"The expression of measure in inches, and vulgar fractions of an inch, has many disadvantages, while the metric decimal system is simple, and doubtless will soon come into general use."
The committee agreed with this approach, and concluded:
"The committee have also reconsidered the expediency of adopting the centimeter as a unit, in accordance with the vote at Philadelphia, querying whether it were really best to substitute this for the familiar inch. They find on investigation that even the opponents of the metric system acknowledge that it is soon to come into general use in this country; that it is already adopted by nearly every other country of importance except England; that it is in itself a unit better adapted to our wants than the inch, which is too large for the measurement of books." (p. 180)

The members of the committee were James L. Whitney, Charles A. Cutter, and Melvil Dewey, the latter having formed the American Metric Bureau in July of 1876, both a kind of lobbying organization and a sales point for metric measures. My guess is that the "investigation" was a chat amongst themselves, and that Dewey was unmovable when it came to using metric measures, although he appears not to have been alone in that. I do love the fact that the inch is "too large," and that its fractions (1/16, etc.) are "vulgar."

Dewey and cohort obviously weren't around when compact discs came on the scene, because those are measured in inches ("1 sound disc : digital ; 4 3/4 in"). However, maps get the metric treatment: "1 map : col. ; 67 x 53 cm folded to 23 x 10 cm". Somewhere there is a record of these decisions, and I hope to come across them.

It would have been ideal if the U.S. had gone metric when Dewey encouraged that move. I suspect that our residual umbilical chord linking us to England is what scuppered that. Yet it is a wonder that we still use those too large, vulgar measurements. Dewey would be very disappointed to learn this.



So there it is, two of the great mysteries solved in the record of the very first year of the American library profession. Here are the readings; I created separate PDFs for the two most relevant sections:

American Library Journal, volume 1, 1876-1877 (from the Internet Archive)
Cutter, Charles A. The use of capitals. American Library Journal, v.1, n. 4-5, 1877. pp. 162-166
The Committee on Sizes of Books, American Library Journal, v.1, n. 4-5, 1877, pages 178-181

Also note that beginning on page 92 there is a near verbatim account of every meeting at the first American Library Association conference in Philadelphia, September, 1876. So verbatim that it includes the mention of who went out for a smoke and missed a key vote. And the advertisements! Give it a look.

Wednesday, August 31, 2016

User tasks, Step one

Brian C. Vickery, one of the greats of classification theory and a key person in the work of the Classification Research Group (active from 1952 to 1968), gave this list of the stages of "the process of acquiring documentary information" in his 1959 book Classification and Indexing in Science[1]:

  1. Identifying the subject of the search. 
  2. Locating this subject in a guide which refers the searcher to one or more documents. 
  3. Locating the documents. 
  4. Locating the required information in the documents. 

These overlap somewhat with FRBR's user tasks (find, identify, select, obtain) but the first step in Vickery's group is my focus here: Identifying the subject of the search. It is a step that I do not perceive as implied in the FRBR "find", and is all too often missing from library/use interactions today.

A person walks into a library... 

Presumably, libraries are an organized knowledge space. If they weren't the books would just be thrown onto the nearest shelf, and subject cataloging would not exist. However, if this organization isn't both visible and comprehended by users, we are, firstly, not getting the return on our cataloging investment and secondly, users are not getting the full benefit of the library.

In Part V of my series on Catalogs and Context, I had two salient quotes. One by Bill Katz: "Be skeptical of the of information the patron presents"[2]; the other by Pauline Cochrane: "Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"[3]. Both of these address the obvious, yet often overlooked, primary point of failure for library users, which is the disconnect between how the user expresses his information need vis-a-vis the terms assigned by the library to the items that may satisfy that need.

Vickery's Three Issues for Stage 1 


Issue 1: Formulating the topic 

Vickery talks about three issues that must be addressed in his first stage, identifying the subject on which to search in a library catalog or indexing database. The first one is "...the inability even of specialist enquirers always to state their requirements exactly..." [1 p.1] That's the "reference interview" problem that Katz writes about: the user comes to the library with an ill-formed expression of what they need. We generally consider this to be outside the boundaries of the catalog, which means that it only exists for users who have an interaction with reference staff. Given that most users of the library today are not in the physical library, and that online services (from Google to Amazon to automated courseware) have trained users that successful finding does not require human interaction, these encounters with reference staff are a minority of the user-library sessions.

In online catalogs, we take what the user types into the search box as an appropriate entry point for a search, even though another branch of our profession is based on the premise that users do not enter the library with a perfectly formulated question, and need an intelligent intervention to have a successful interaction with the library. Formulating a precise question may not be easy, even for experienced researchers. For example, in a search about serving persons who have been infected with HIV, you may need to decide whether the research requires you to consider whether the person who is HIV positive has moved along the spectrum to be medically diagnosed as having AIDS. This decision is directly related to the search that will need to be done:

HIV-positive persons--Counseling of
AIDS (Disease)--Patients--Counseling of

Issue 2: from topic to query 

The second of Vickery's caveats is that "[The researcher] may have chosen the correct concepts to express the subject, but may not have used the standard words of the index."[1 p.4] This is the "entry vocabulary" issue. What user would guess that the question "Where all did Dickens live?" would be answered with a search using "Dickens, Charles -- Homes and haunts"? And that all of the terms listed as "use for" below would translate to the term "HIV (Viruses)" in the catalog? (h/t Netanel Ganin):

As Pauline Cochrane points out[4], beginning in the latter part of the 20th century, libraries found themselves unable to include the necessary cross-reference information in their card catalogs, due to the cost of producing the cards. Instead, they asked users to look up terms in the subject heading reference books used by catalog librarians to create the headings. These books are not available to users of online catalogs, and although some current online catalogs include authorized alternate entry points in their searches, many do not.* This means that we have multiple generations of users who have not encountered "term switching" in their library catalog usage, and who probably do not understand its utility.

Even with such a terminology-switching mechanism, finding the proper entry in the catalog is not at all simple. The article by Thomas Mann (of Library of Congress, not the German author) on “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” [5] shows not only how complex that process might be, but it also indicates that the translation can only be accomplished by a library-trained expert. This presents us with a great difficulty because there are not enough such experts available to guide users, and not all users are willing to avail themselves of those services. How would a user discover that literature is French, but performing arts are in France?:

French literature
Performing arts -- France -- History

Or, using the example in Mann's piece, the searcher looking for in information on tribute payments in the Peloponnesian war needed to look under "Finance, public–Greece–Athens".  This type of search failure fuels the argument that full text search is a better solution, and a search of Google Books on "tribute payments Peloponnesian war" does yield some results. The other side of the argument is that full text searches fail to retrieve documents not in the search language, while library subject headings apply to all materials in all languages. Somehow, this latter argument, in my experience, doesn't convince.

Issue 3: term order 

The third point by Vickery is one that keyword indexing has solved, which is "...the searcher may use the correct words to express the subject, but may not choose the correct combination order."[1 p.4] In 1959, when Vickery was writing this particular piece, having the wrong order of terms resulted in a failed search. Mann, however, would say that with keyword searching the user does not encounter the context that the pre-coordinated headings provide; thus keyword searching is not a solution at all. I'm with him part way, because I think keyword searching as an entry to a vocabulary can be useful if the syndetic structure is visible with such a beginning. Keyword searching directly against bibliographic records, less so.

Comparison to FRBR "find" 

FRBR's "find" is described as "to find entities that correspond to the user’s stated search criteria". [6 p. 79] We could presume that in FRBR the "user's stated search criteria" has either been modified through a prior process (although I hardly know what that would be, other than a reference interview), or that the library system has the capability to interact with the user in such a way that the user's search is optimized to meet the terminology of the library's knowledge organization system. This latter would require some kind of artificial intelligence and seems unlikely. The former simply does not happen often today, with most users being at a computer rather than a reference desk. FRBR's find seems to carry the same assumption as has been made functional in online catalogs, which is that the appropriateness of the search string is not questioned.

Summary 

There are two take-aways from this set of observations:

  1. We are failing to help users refine their query, which means that they may actually be basing their searches on concepts that will not fulfill their information need in the library catalog. 
  2. We are failing to help users translate their query into the language of the catalog(s). 

I would add that the language of the catalog should show users how the catalog is organized and how the knowledge universe is addressed by the library. This is implied in the second take-away, but I wanted to bring it out specifically, because it is a failure that particularly bothers me.

Notes

*I did a search in various catalogs on "cancer" and "carcinoma". Cancer is the form used in LCSH-cataloged bibliographic records, and carcinoma is a cross reference. I found a local public library whose Bibliocommons catalog did retrieve all of the records with "cancer" in them when the search was on "carcinoma"; and that the same search in the Harvard Hollis system did not (carcinoma: 1889 retrievals; cancer 21,311). These are just two catalogs, and not a representative sample, to say the least, but the fact seems to be shown.

References

[1] Vickery, B C. Classification and Indexing in Science. New York: Academic Press, 1959.
[2] Katz, Bill. Introduction to Reference Work: Reference Services and Reference Processes. New York: McGraw-Hill, 1992. p. 82 http://www.worldcat.org/oclc/928951754. Cited in: Brown, Stephanie Willen. The Reference Interview: Theories and Practice. Library Philosophy and Practice 2008. ISSN 1522-0222
[3] Modern Subject Access in the Online Age: Lesson 3 Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr. Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255 Stable URL: http://www.jstor.org/stable/25626708
[4] Modern Subject Access in the Online Age: Lesson 2 Pauline A. Cochrane American Libraries Vol. 15, No. 3 (Mar., 1984), pp. 145-148, 150 Stable URL: http://www.jstor.org/stable/25626647
[5] Thomas Mann, “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” (June 13, 2007). PDF, 41 pp. http://guild2910.org/Pelopponesian%20War%20June%2013%202007.pdf
[6] IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records, 2009. http://archive.ifla.org/VII/s13/frbr/frbr_2008.pdf.

Sunday, August 21, 2016

Wikipedia and the numbers falacy

One of the main attempts at solutions to the lack of women on Wikipedia is to encourage more women to come to Wikipedia and edit. The idea is that greater numbers of women on Wikipedia will result in greater equality on the platform; that there will be more information about women and women's issues, and a hoped for "civilizing influence" on the brutish culture.

This argument is so obviously specious that it is hard for me to imagine that it is being put forth by educated and intelligent people. Women are not a minority - we are around 52% of the world's population and, with a few pockets of exception, we are culturally, politically, sexually, and financially oppressed throughout the planet. If numbers created more equality, where is that equality for women?

The "woman problem" is not numerical and it cannot solved with numbers. The problem is cultural; we know this because attacks against women can be traced to culture, not numbers: the brutal rapes in India, the harassment of German women by recent-arrived immigrant men at the Hamburg railway station on New Year's eve, the racist and sexist attacks on Leslie Jones on Twitter -- none of these can be explained by numbers. In fact, the stats show that over 60% of Twitter users are female, and yet Jones was horribly attacked. Gamergate arose at a time when the number of women in gaming is quite high, with data varying from 40% to over 50% of gamers being women. Women gamers are attacked not because there are too few of them, and there does not appear to be any safety in numbers.

The numbers argument is not only provably false, it is dangerous if mis-applied. Would women be safer walking home alone at night if we encouraged more women to do it?  Would having more women at frat parties reduce the rape culture on campus? Would women on Wikipedia be safer if there were more of them? (The statistics from 2011 showed that 13% of editors were female. The Wikimedia Foundation had a goal to increase the number to 25% by 2015, but Jimmy Wales actually stated in 2015 that the number of women was closer to 10% than 25%.) I think that gamergate and Twitter show us that the numbers are not the issue.

In fact, Wikipedia's efforts may have exacerbated the problem. The very public efforts to bring more women editors into Wikipedia (there have been and are organized campaigns both for women and about women) and the addition of more articles by and about women is going to be threatening to some members of the Wikipedia culture. In a recent example, an edit-a-thon produced twelve new articles about women artists. They were immediately marked for deletion, and yet, after analysis, ten of the articles were determined to be suitable, and only two were lost. It is quite likely that twelve new articles about male scientists (Wikipedia greatly values science over art, another bias) would not have produced this reaction; in fact, they might have sailed into the encyclopedia space without a hitch. Some editors are rebelling against the addition of information about women on Wikipedia, seeing it as a kind of reverse sexism (something that came up frequently in the attack on me).

Wikipedia's culture is a "self-run" society. So was the society in the Lord of the Flies. If you are one of the people who believe that we don't need government, that individuals should just battle it out and see who wins, then Wikipedia might be for you. If, instead, you believe that we have a social obligation to provide a safe environment for people, then this self-run society is not going to be appealing. I've felt what it's like to be "Piggy" and I can tell you that it's not something I would want anyone else to go through.

I'm not saying that we do not want more women editing Wikipedia. I am saying that more women does not equate to more safety for women. The safety problem is a cultural problem, not a numbers problem. One of the big challenges is how we can define safety in an actionable way. Title IX, the US statute mandating equality of the sexes in education,  revolutionized education and education-related sports. Importantly, it comes under the civil rights area of the Department of Justice. We need a Title IX for the Internet; one that requires those providing public services to make sure that there is no discrimination based on sex. Before we can have such a solution, we need to determine how to define "non-discrimination" in that context. It's not going to be easy, but it is a pre-requisite to solving the problem.