Sunday, August 21, 2016

Wikipedia and the numbers falacy

One of the main attempts at solutions to the lack of women on Wikipedia is to encourage more women to come to Wikipedia and edit. The idea is that greater numbers of women on Wikipedia will result in greater equality on the platform; that there will be more information about women and women's issues, and a hoped for "civilizing influence" on the brutish culture.

This argument is so obviously specious that it is hard for me to imagine that it is being put forth by educated and intelligent people. Women are not a minority - we are around 52% of the world's population and, with a few pockets of exception, we are culturally, politically, sexually, and financially oppressed throughout the planet. If numbers created more equality, where is that equality for women?

The "woman problem" is not numerical and it cannot solved with numbers. The problem is cultural; we know this because attacks against women can be traced to culture, not numbers: the brutal rapes in India, the harassment of German women by recent-arrived immigrant men at the Hamburg railway station on New Year's eve, the racist and sexist attacks on Leslie Jones on Twitter -- none of these can be explained by numbers. In fact, the stats show that over 60% of Twitter users are female, and yet Jones was horribly attacked. Gamergate arose at a time when the number of women in gaming is quite high, with data varying from 40% to over 50% of gamers being women. Women gamers are attacked not because there are too few of them, and there does not appear to be any safety in numbers.

The numbers argument is not only provably false, it is dangerous if mis-applied. Would women be safer walking home alone at night if we encouraged more women to do it?  Would having more women at frat parties reduce the rape culture on campus? Would women on Wikipedia be safer if there were more of them? (The statistics from 2011 showed that 13% of editors were female. The Wikimedia Foundation had a goal to increase the number to 25% by 2015, but Jimmy Wales actually stated in 2015 that the number of women was closer to 10% than 25%.) I think that gamergate and Twitter show us that the numbers are not the issue.

In fact, Wikipedia's efforts may have exacerbated the problem. The very public efforts to bring more women editors into Wikipedia (there have been and are organized campaigns both for women and about women) and the addition of more articles by and about women is going to be threatening to some members of the Wikipedia culture. In a recent example, an edit-a-thon produced twelve new articles about women artists. They were immediately marked for deletion, and yet, after analysis, ten of the articles were determined to be suitable, and only two were lost. It is quite likely that twelve new articles about male scientists (Wikipedia greatly values science over art, another bias) would not have produced this reaction; in fact, they might have sailed into the encyclopedia space without a hitch. Some editors are rebelling against the addition of information about women on Wikipedia, seeing it as a kind of reverse sexism (something that came up frequently in the attack on me).

Wikipedia's culture is a "self-run" society. So was the society in the Lord of the Flies. If you are one of the people who believe that we don't need government, that individuals should just battle it out and see who wins, then Wikipedia might be for you. If, instead, you believe that we have a social obligation to provide a safe environment for people, then this self-run society is not going to be appealing. I've felt what it's like to be "Piggy" and I can tell you that it's not something I would want anyone else to go through.

I'm not saying that we do not want more women editing Wikipedia. I am saying that more women does not equate to more safety for women. The safety problem is a cultural problem, not a numbers problem. One of the big challenges is how we can define safety in an actionable way. Title IX, the US statute mandating equality of the sexes in education,  revolutionized education and education-related sports. Importantly, it comes under the civil rights area of the Department of Justice. We need a Title IX for the Internet; one that requires those providing public services to make sure that there is no discrimination based on sex. Before we can have such a solution, we need to determine how to define "non-discrimination" in that context. It's not going to be easy, but it is a pre-requisite to solving the problem.

Wednesday, August 17, 2016

Classification, RDF, and promiscuous vowels

"[He] provided (i) a classified schedule of things and concepts, (ii) a series of conjunctions or 'particles' whereby the elementary terms can be combined to express composite subjects, (iii) various kinds of notational devices ... as a means of displaying relationships between terms." [1]

"By reducing the complexity of natural language to manageable sets of nouns and verbs that are well-defined and unambiguous, sentence-like statements can be interpreted...."[2]

The "he" in the first quote is John Wilkins, and the date is 1668.[3] His goal was to create a scientifically correct language that would have one and only one term for each thing, and then would have a set of particles that would connect those things to make meaning. His one and only one term is essentially an identifier. His particles are linking elements.

The second quote is from a publication about OCLC's linked data experiements, and is about linked data, or RDF. The goals are so obviously similar that it can't be overlooked. Of course there are huge differences, not the least of which is the technology of the time.*

What I find particularly interesting about Wilkins is that he did not distinguish between classification of knowledge and language. In fact, he was creating a language, a vocabulary, that would be used to talk about the world as classified knowledge. Here we are at a distance of about 350 years, and the language basis of both his work and the abstract grammar of the semantic web share a lot of their DNA. They are probably proof of some Chomskian theory of our brain and language, but I'm really not up to reading Chomsky at this point.

The other interesting note is how similar Wilkins is to Melvil Dewey. He wanted to reform language and spelling. Here's the section where he decries alphabetization because the consonants and vowels are "promiscuously huddled together without distinction." This was a fault of language that I have not yet found noted in Dewey's work. Could he have missed some imperfection?!





*Also, Wilkins was a Bishop in the Anglican church, and so his description of the history of language is based literally on the Bible, which makes for some odd conclusions.



[1]Schulte-Albert, Hans G. Classificatory Thinking from Kinner to Wilkins: Classification and Thesaurus Construction, 1645-1668. Quoting from Vickery, B. C. "The Significance of John Wilkins in the History of Bibliographical Classification." Libri 2 (1953): 326-43.
[2]Godby, Carol J, Shenghui Wang, and Jeffrey Mixter. Library Linked Data in the Cloud: Oclc's Experiments with New Models of Resource Description. , 2015.
[3] Wilkins, John. Essay Towards a Real Character, and a Philosophical Language. S.l: Printed for Sa. Gellibrand, and for John Martyn, 1668.

Tuesday, August 16, 2016

The case of the disappearing classification

I'm starting some research into classification in libraries (now that I have more time due to having had to drop social media from my life; see previous post). The main question I want to answer is: why did research into classification drop off at around the same time that library catalogs computerized? This timing may just be coincidence, but I'm suspecting that it isn't.

 I was in library school in 1971-72, and then again in 1978-80. In 1971 I took the required classes of cataloging (two semesters), reference, children's librarianship, library management, and an elective in law librarianship. Those are the ones I remember. There was not a computer in the place, nor do I remember anyone mentioning them in relation to libraries. I was interested in classification theory, but not much was happening around that topic in the US. In England, the Classification Research Group was very active, with folks like D.J. Foskett and Brian Vickery as mainstays of thinking about faceted classification. I wrote my first published article about a faceted classification being used by a UN agency.[1]

 In 1978 the same school had only a few traditional classes. I'd been out of the country, so the change to me was abrupt. Students learned to catalog on OCLC. (We had typed cards!) I was hired as a TA to teach people how to use DIALOG for article searching, even though I'd never seen it used, myself. (I'd already had a job as a computer programmer, so it was easy to learn the rules of DIALOG searching.) The school was now teaching "information science". Here's what that consisted of at the time: research into term frequency of texts; recall and precision; relevance ranking; database development.

I didn't appreciate it at the time, but the school had some of the bigger names in these areas, including William Cooper and M. E. "Bill" Maron. (I only just today discovered why he called himself Bill - the M. E., which is what he wrote under in academia, stands for "Melvin Earl". Even for a nerdy computer scientist, that was too much nerdity.) 1978 was still the early days of computing, at least unless you were on a military project grant or worked for the US Census Bureau. The University of California, Berkeley, did not have visible Internet access. Access to OCLC or DIALOG was via dial-up to their proprietary networks. (I hope someone has or will write that early history of the OCLC network. For its time it must have been amazing.)

The idea that one could search actual text was exciting, but how best to do it was (and still is, to a large extent) unclear. There was one paper, although I so far have not found it, that was about relevance ranking, and was filled with mathematical formulas for calculating relevance. I was determined to understand it, and so I spent countless hours on that paper with a cheat sheet beside me so I could remember what uppercase italic R was as opposed to lower case script r. I made it through the paper to the very end, where the last paragraph read (as I recall): "Of course, there is no way to obtain a value for R[elevance], so this theory cannot be tested." I could have strangled the author (one of my profs) with my bare hands.

Looking at the articles, now, though, I see that they were prescient; or at least that they were working on the beginnings of things we now take for granted. One statement by Maron especially strikes me today:
A second objective of this paper is to show that about is, in fact, not the central concept in a theory of document retrieval. A document retrieval system ought to provide a ranked output (in response to a search query) not according to the degree that they are about the topic sought by the inquiring patron, but rather according to the probability that they will satisfy that person‘s information need. This paper shows how aboutness is related to probability of satisfaction.[2] 
This is from 1977, and it essentially describes the basic theory behind Google ranking. It doesn't anticipate hyperlinking, of course, but it does anticipate that "about" is not the main measure of what will satisfy a searcher's need. Classification, in the traditional sense, is the quintessence of about. Is this the crux of the issue? As yet, I don't know. More to come.

[1]Coyle, Karen (1975). "A Faceted Classification for Occupational Safety and Health". Special Libraries. 66 (5-6): 256–9.
[2]Maron, M. E. (1977) "On Indexing, Retrieval, and the Meaning of About". Journal of the American Society for Information Science, January, 1977, pp. 38-43

Thursday, August 11, 2016

This is what sexism looks like: Wikipedia

We've all heard that there are gender problems on Wikipedia. Honestly there are a lot of problems on Wikipedia, but gender disparity is one of them. Like other areas of online life, on Wikipedia there are thinly disguised and not-so thinly disguised attacks on women. I am at the moment the victim of one of those attacks.

Wikipedia runs on a set of policies that are used to help make decisions about content and to govern behavior. In a sense, this is already a very male approach, as we know from studies of boys and girls at play: boys like a sturdy set of rules, and will spend considerable time arguing whether or not rules are being followed; girls begin play without establishing a set of rules, develop agreed rules as play goes on if needed, but spend little time on discussion of rules.

If you've been on Wikipedia and have read discussions around various articles, you know that there are members of the community that like to "wiki-lawyer" - who will spend hours arguing whether something is or is not within the rules. Clearly, coming to a conclusion is not what matters; this is blunt force, nearly content-less arguing. It eats up hours of time, and yet that is how some folks choose to spend their time. There are huge screaming fights that have virtually no real meaning; it's a kind of fantasy sport.

Wiki-lawyering is frequently used to harass. It is currently going on to an amazing extent in harassment of me, although since I'm not participating, it's even emptier. The trigger was that I sent back for editing two articles about men that two wikipedians thought should not have been sent back. Given that I have reviewed nearly 4000 articles, sending back 75% of those for more work, these two are obviously not significant. What is significant, of course, is that a woman has looked at an article about a man and said: "this doesn't cut it". And that is the crux of the matter, although the only person to see that is me. It is all being discussed as violations of policy, although there are none. But sexism, as with racism, homophobia, transphobia, etc., is almost never direct (and even when it is, it is often denied). Regulating what bathrooms a person can use, or denying same sex couples marriage, is a kind of lawyering around what the real problem is. The haters don't say "I hate transexuals" they just try to make them as miserable as possible by denying them basic comforts. In the past, and even the present, no one said "I don't want to hire women because I consider them inferior" they said "I can't hire women because they just get pregnant and leave."

Because wiki-lawyering is allowed, this kind of harassment is allowed. It's now gone on for two days and the level of discourse has gotten increasingly hysterical. Other than one statement in which I said I would not engage because the issue is not policy but sexism (which no one can engage with), it has all been between the wiki-lawyers, who are working up to a lynch mob. This is gamer-gate, in action, on Wikipedia.

It's too bad. I had hopes for Wikipedia. I may have to leave. But that means one less woman editing, and we were starting to gain some ground.

The best read on this topic, mainly about how hard it is to get information that is threatening to men (aka about women) into Wikipedia: WP:THREATENING2MEN: Misogynist Infopolitics and the Hegemony of the Asshole Consensus on English Wikipedia

I have left Wikipedia, and I also had to delete my Twitter account because they started up there. I may not be very responsive on other media for a while. Thanks to everyone who has shown support, but if by any chance you come across a kinder, gentler planet available for habitation, do let me know. This one's desirability quotient is dropping fast.

Sunday, July 10, 2016

Catalogs and Context: Part V

This entire series is available a single file on my web site.

Before we can posit any solutions to the problems that I have noted in these posts, we need to at least know what questions we are trying to answer. To me, the main question is:

What should happen between the search box and the bibliographic display?

Or as Pauline Cochrane asked: "Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"[1] I really like the "suggestion about how to proceed" that she included there. Although I can think of some exceptions, I do consider this an important question.

If you took a course in reference work at library school (and perhaps such a thing is no longer taught - I don't know), then you learned a technique called "the reference interview." The Wikipedia article on this is not bad, and defines the concept as an interaction at the reference desk "in which  the librarian responds to the user's initial explanation of his or her information need by first attempting to clarify that need and then by directing the user to appropriate information resources." The assumption of the reference interview is that the user arrives at the library with either an ill-formed query, or one that is not easily translated to the library's sources. Bill Katz's textbook "Introduction to Reference Work" makes the point bluntly:

"Be skeptical of the of information the patron presents" [2]

If we're so skeptical that the user could approach the library with the correct search in mind/hand, then why then do we think that giving the user a search box in which to put that poorly thought out or badly formulated search is a solution? This is another mind-boggler to me.

So back to our question, what SHOULD happen between the search box and the bibliographic display? This is not an easy question, and it will not have a simple answer. Part of the difficulty of the answer is that there will not be one single right answer. Another difficulty is that we won't know a right answer until we try it, give it some time, open it up for tweaking, and carefully observe. That's the kind of thing that Google does when they make changes in their interface, but we haven't got either Google's money nor its network (we depend on vendor systems, which define what we can and cannot do with our catalog).

Since I don't have answers (I don't even have all of the questions) I'll pose some questions, but I really want input from any of you who have ideas on this, since your ideas are likely to be better informed than mine. What do we want to know about this problem and its possible solutions?

(Some of) Karen's Questions

Why have we stopped evolving subject access?

Is it that keyword access is simply easier for users to understand? Did the technology deceive us into thinking that a "syndetic apparatus" is unnecessary? Why have the cataloging rules and bibliographic description been given so much more of our profession's time and development resources than subject access has? [3]

Is it too late to introduce knowledge organization to today's users?

The user of today is very different to the user of pre-computer times. Some of our users have never used a catalog with an obvious knowledge organization structure that they must/can navigate. Would they find such a structure intrusive? Or would they suddenly discover what they had been missing all along? [4]

Can we successfully use the subject access that we already have in library records?

Some of the comments in the articles organized by Cochrane in my previous post were about problems in the Library of Congress Subject Headings (LCSH), in particular that the relationships between headings were incomplete and perhaps poorly designed.[5] Since LCSH is what we have as headings, could we make them better? Another criticism was the sparsity of "see" references, once dictated by the difficulty of updating LCSH. Can this be ameliorated? Crowdsourced? Localized?

We still do not have machine-readable versions of the Library of Congress Classification (LCC), and the machine-readable Dewey Decimal Classification (DDC) has been taken off-line (and may be subject to licensing). Could we make use of LCC/DDC for knowledge navigation if they were available as machine-readable files?

Given that both LCSH and LCC/DDC have elements of post-composition and are primarily instructions for subject catalogers, could they be modified for end-user searching, or do we need to develop a different instrument altogether?

How can we measure success?

Without Google's user laboratory apparatus, the answer to this may be: we can't. At least, we cannot expect to have a definitive measure. How terrible would it be to continue to do as we do today and provide what we can, and presume that it is better than nothing? Would we really see, for example, a rise in use of library catalogs that would confirm that we have done "the right thing?"


Notes

[1]*Modern Subject Access in the Online Age: Lesson 3
Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr.
Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255
Stable URL: http://www.jstor.org/stable/25626708

[2] Katz, Bill. Introduction to Reference Work: Reference Services and Reference Processes. New York: McGraw-Hill, 1992. p. 82 http://www.worldcat.org/oclc/928951754. Cited in: Brown, Stephanie Willen. The Reference Interview: Theories and Practice. Library Philosophy and Practice 2008. ISSN 1522-0222

[3] One answer, although it doesn't explain everything, is economic: the cataloging rules are published by the professional association and are a revenue stream for it. That provides an incentive to create new editions of rules. There is no economic gain in making updates to the LCSH. As for the classifications, the big problem there is that they are permanently glued onto the physical volumes making retroactive changes prohibitive. Even changes to descriptive cataloging must be moderated so as to minimize disruption to existing catalogs, which we saw happen during the development of RDA, but with some adjustments the new and the old have been made to coexist in our catalogs.

[4] Note that there are a few places online, in particular Wikipedia, where there is a mild semblance of organized knowledge and with which users are generally familiar. It's not the same as the structure that we have in subject headings and classification, but users are prompted to select pre-formed headings, with a keyword search being secondary.

[5] Simon Spero did a now famous (infamous?) analysis of LCSH's structure that started with Biology and ended with Doorbells.

Monday, July 04, 2016

Catalogs and Content: an Interlude

This entire series is available a single file on my web site.

"Editor's note. Providing subject access to information is one of the most important professional services of librarians; yet, it has been overshadowed in recent years by AACR2, MARC, and other developments in the bibliographic organization of information resources. Subject access deserves more attention, especially now that results are pouring in from studies of online catalog use in libraries."
American Libraries, Vol. 15, No. 2 (Feb., 1984), pp. 80-83
Having thought and written about the transition from card catalogs to online catalogs, I began to do some digging in the library literature, and struck gold. In 1984, Pauline Atherton Cochrane, one of the great thinkers in library land, organized a six-part "continuing education" to bring librarians up to date on the thinking regarding the transition to new technology. (Dear ALA - please put these together into a downloaded PDF for open access. It could make a difference.) What is revealed here is both stunning and disheartening, as the quote above shows; in terms of catalog models, very little progress has been made, and we are still spending more time organizing atomistic bibliographic data while ignoring subject access.

The articles are primarily made up of statements by key library thinkers of the time, many of whom you will recognize. Some responses contradict each other, others fall into familiar grooves. Library of Congress is criticized for not moving faster into the future, much as it is today, and yet respondents admit that the general dependency on LC makes any kind of fast turn-around of changes difficult. Some of the desiderata have been achieved, but not the overhaul of subject access in the library catalog.

The Background

If you think that libraries moved from card catalogs to online catalogs in order to serve users better, think again. Like other organizations that had a data management function, libraries in the late 20th century were reaching the limits of what could be done with analog technology. In fact, as Cochrane points out, by the mid-point of that century libraries had given up on the basic catalog function of providing cross references from unused to used terminology, as well as from broader and narrower terms in the subject thesaurus. It simply wasn't possible to keep up with these, not to mention that although the Library of Congress and service organizations like OCLC provided ready-printed cards for bibliographic entries, they did not provide the related reference cards. What libraries did (and I remember this from my undergraduate years) is they placed near the card catalog copies of the "Red Book". This was the printed Library of Congress Subject Heading list, which by my time was in two huge volumes, and, yes, was bound in red. Note that this was the volume that was intended for cataloging librarians who were formulating subject headings for their collections. It was never intended for the end-users of the catalog. The notation ("x", "xx", "sa") was far from intuitive. In addition, for those users who managed to follow the references, it pointed them to the appropriate place in LCSH, but not necessarily in the catalog of the library in which they were searching. Thus a user could be sent to an entry that simply did not exist.

The "RedBook" today
From my own experience, when we brought up the online catalog at the University of California, the larger libraries had for years had difficulty keeping the card catalog up to date. The main library at the University of California at Berkeley regularly ran from 100,000 to 150,000 cards behind in filing into the catalog, which filled two enormous halls. That meant that a book would be represented in the catalog about three months after it had been cataloged and shelved. For a research library, this was a disaster. And Berkeley was not unusual in this respect.

Computerization of the catalog was both a necessary practical solution, as well as a kind of holy grail. At the time that these articles were written, only a few large libraries had an online catalog, and that catalog represented only a recent portion of the library's holdings. (Retrospective conversion of the older physical card catalog to machine-readable form came later, culminating in the 1990's.) Abstracting and indexing databases had preceded libraries in automating, DIALOG, PRECIS, and others, and these gave librarians their first experience in searching computerized bibliographic data.

This was the state of things when Cochrane presented her 6-part "continuing education" series in American Libraries.

Subject Access

The series of articles was stimulated by an astonishingly prescient article by Marcia Bates in 1977. In that article she articulates both concerns and possibilities that, quite frankly, we should all take to heart today. In Lesson 3 of Cochrane's articles, Bates is quotes from 1977 saying:
"...with automation, we have the opportunity to introduce many access points to a given book. We can now use a subject approach... that allows the naive user, unconscious of and uninterested in the complexities of synonymy and vocabulary control, to blunder on to desired subjects, to be guided, without realizing it, by a redundant but carefully controlled subject access system." 
and
"And now is the time to change -- indeed, with MARC already so highly developed, past time. If we simply transfer the austerity-based LC subject heading approach to expensive computer systems, then we have used our computers merely to embalm the constraints that were imposed on library systems back before typewriters came into use!"

This emphasis on subject access was one of the stimuli for the AL lessons. In the early 1980's, studies done at OCLC and elsewhere showed that over 50% of the searches being done in the online catalogs of that day were subject searches, even those going against title indexes or mixed indexes. (See footnotes to Lesson 3.) Known item searching was assumed to be under control, but subject searching posed significant problems. Comments in the article include:
"...we have not yet built into our online systems much of the structure for subject access that is already present in subject cataloging. That structure is internal and known by the person analyzing the work; it needs to be external and known by the person seeking the work."
"Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"
Interestingly, I don't see that any of these problems has been solved into today's systems.

As a quick review, here are some of the problems, some proposed solutions, and some hope for future technologies that are presented by the thinkers that contributed to the lessons.

Problems noted

Many problems were surfaced, some with fairly simple solutions, others that we still struggle with.
  • LCSH is awkward, if not nearly unusable, both for its vocabulary and for the lack of a true hierarchical organization
  • Online catalogs' use of LCSH lacks syndetic structure (see, see also, BT, NT). This is true not only for display, but in retrieval, search on a broader term does not retrieve items with a narrower term (which would be logical to at least some users)
  • Libraries assign too few subject headings
  • For the first time, some users are not in the library while searching so there are no intermediaries (e.g. reference librarians) available. (One of the flow diagrams has a failed search pointing to a box called "see librarian" something we would not think to include today.)
  • Lack of a professional theory of information seeking behavior that would inform systems design. ("Without a blueprint of how most people want to search, we will continue to force them to search the we want to search." Lesson 5)
  • Information overload, aka overly large results, as well as too few results on specific searches

Proposed solutions

Some proposed solutions were mundane (add more subject headings to records) while others would require great disruption to the library environment.
  • Add more subject headings to MARC records
  • Use keyword searching, including keywords anywhere in the record.
  • Add uncontrolled keywords to the records.
  • Make the subject authority file machine-readable and integrate it into online catalogs.
  • Forget LCSH, instead use non-library bibliographic files for subject searching, such as A&I databases.
  • Add subject terms from non-library sources to the library catalog, and/or do (what today we call) federated searching
  • LCSH must provide headings that are more specific as file sizes and retrieved sets grow (in the document, a retrieved set of 904 items was noted with an exclamation point)

Future thinking 

As is so often the case when looking to the future, some potential technologies were seen as solutions. Some of these are still seen as solutions today (c.f. artificial intelligence), while others have been achieved (storage of full text).
  • Full text searching, natural language searches, and artificial intelligence will make subject headings and classification unnecessary
  • We will have access to back-of-the-book indexes and tables of contents for searching, as well as citation indexing
  • Multi-level systems will provide different interfaces for experts and novices
  • Systems will be available 24x7, and there will be a terminal in every dorm room
  • Systems will no longer need to use stopwords
  • Storage of entire documents will become possible

End of Interlude

Although systems have allowed us to store and search full text, to combine bibliographic data from different sources, and to deliver world-wide, 24x7, we have made almost no progress in the area of subject access. There is much more to be learned from these articles, and it would be instructive to do an in-depth comparison of them to where we are today. I greatly recommend reading them, each is only a few pages long.

----- The Lessons -----

*Modern Subject Access in the Online Age: Lesson 1
by Pauline Atherton Cochrane
Source: American Libraries, Vol. 15, No. 2 (Feb., 1984), pp. 80-83
Stable URL: http://www.jstor.org/stable/25626614

*Modern Subject Access in the Online Age: Lesson 2 Pauline A. Cochrane American Libraries Vol. 15, No. 3 (Mar., 1984), pp. 145-148, 150 Stable URL: http://www.jstor.org/stable/25626647

*Modern Subject Access in the Online Age: Lesson 3
Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr.
Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255
Stable URL: http://www.jstor.org/stable/25626708

*Modern Subject Access in the Online Age: Lesson 4
Author(s): Pauline A. Cochrane, Carol Mandel, William Mischo, Shirley Harper, Michael Buckland, Mary K. D. Pietris, Lucia J. Rather and Fred E. Croxton
Source: American Libraries, Vol. 15, No. 5 (May, 1984), pp. 336-339
Stable URL: http://www.jstor.org/stable/25626747

*Modern Subject Access in the Online Age: Lesson 5
Author(s): Pauline A. Cochrane, Charles Bourne, Tamas Doczkocs, Jeffrey C. Griffith, F. Wilfrid Lancaster, William R. Nugent and Barbara M. Preschel
Source: American Libraries, Vol. 15, No. 6 (Jun., 1984), pp. 438-441, 443
Stable URL: http://www.jstor.org/stable/25629231

*Modern Subject Access In the Online Age: Lesson 6
Author(s): Pauline A. Cochrane, Brian Aveney and Charles Hildreth Source: American Libraries, Vol. 15, No. 7 (Jul. - Aug., 1984), pp. 527-529
Stable URL: http://www.jstor.org/stable/25629275

Wednesday, June 29, 2016

Catalog and Context, Part IV

Part I, Part II, Part III

(I fully admit that this topic deserves a much more extensive treatment than I will give it here. My goal is to stimulate discussion that would lead to efforts to develop models of that catalog that support a better user experience.)

Recognizing that users need a way to make sense out of large result sets, some library catalogs have added features that attempt to provide context for the user. The main such effort that I am aware of is the presentation of facets derived from some data in the bibliographic records. Another model, although I haven't seen it integrated well into library catalogs, is data mining; doing an overall analysis combining different data in the records, and making this available for search. Lastly, we have the development of entities that are catalog elements in their own right; this generally means the treatment of authors, subjects, etc., as stand-alone topics to be retrieved and viewed, even apart from their role in the retrieval of a bibliographic item. Treating these as "first-class entities" is not the same as the heading layer over bibliographic records, but it may be exploitable to provide a kind of context for users.

Facets

Faceted classification was all the rage when I attended library school in the early 1970's, having been bolstered by the work of the UK-based Classification Research Group, although the prime mover of this type of classification was S R Ranganathan who thoroughly explicated the concept in the 1930s. Faceted classification was to 1970's knowledge organization what KWIC and KWOC were to text searching: facets potentially provided a way to create complex subject headings whose individual parts could be the subject of access on their own or in context.

In library systems "faceting" has exploited information from the bibliographic record can be discretely applied to a retrieved set. Facets are all "accidents" of the existing data, as catalog record creation is not based on faceted cataloging.

In general, facets are fixed data elements, or whole or part heading strings. Authors are used as facets, generally showing the top-occurring author names with counts.
Authors as facets
Date of publication is also a commonly used facet, not so much because it is inherently useful but mainly because "it exists."

Dates as facets

Subject Facets

Faceting is, to a degree, already incorporated into our subject access systems. Library of Congress subject headings are faceted to some extent, with topic facets, geographic facets, and time facets. The Library of Congress Classification and the Dewey Decimal Classification make some use of facets where they allow entries in the classification to be extended by place, time, or other re-usable subdivisions.

Some systems have taken a page from the FAST book. FAST is Faceted Application of Subject Terminology, and it creates facets by breaking apart the segments of a Library of Congress subject heading such that topics, geographical entries, and time periods become separate entries. FAST itself does more than this, including turning some inverted headings (Lake, Erie) back to their natural order, and other changes. One of the main criticisms of FAST, however, is that it loses the very context that is provided by the composite subject heading. Thus the headings on Moby Dick become Whales / Whaling / Mentally Ill / Fiction, and leaves it unclear who or what is mentally ill in this example. (I'm sure there are better examples - send them in!)

Summon system use of facets

The Open Library created subject facets from Library of Congress subject headings, and categorizes each by its facet "type":
Open Library subject facts


Although these are laudable attempts to give the user a way to both understand and further refine the retrieved set, there are a number of problems with these implementations, not the least of which is that many of these are not actually facets in the knowledge organization sense of that term. Facets need to be conceptual divisions of the landscape that help a user understand that landscape.
Online sales sites use something that they call faceted classification, although it varies considerably from the concept of faceted classification that originated with S. R. Ranganathan in the 1930's. On a sales site, facets divide the site's products into categories so that users can add those categories to their searches. A search for shoes in general is less useful than a search for shoes done under the categories "men's", "women's" or "children's". In the online sales sense, a facet is a context for the keyword search. For all that the overall universe that these facets govern is much simpler than the entire knowledge universe that libraries must try to handle, at least the concept of context is employed to help the user.

Amazon's facets
While it may be helpful to see who are the most numerous authors in a retrieved set, authorship does not provide a conceptual organization for the user. Next, not everything that can be exploited in a bibliographic record to narrow a result set is necessarily useful. The list of publication dates from the retrieved set is not only too granular to be a useful facet (think of how many different dates there could be) but the likelihood that a user's query can be fulfilled by a publication year datum is scant indeed.

The last problem is really the key here, which is that while isolated bits of data like date or place may help narrow a large result set they do not provide the kind of overall context for searches that a truly faceted system might. However, providing such a view requires that the entries in the library catalog have been classified using a faceted classification system, and that is simply not the case.

Data Mining


I include this because I think it is interesting, although the only real instances of it that I am aware of come from OCLC, which is uniquely positioned to do the kind of "big data" work that this implies. The WorldCat Identities project shows the kind of data that one can extract from a large bibliographic database. Data mining applies best to the bibliographic universe as a whole, rather than individual catalogs, since those latter are by definition incomplete. It would, however, be interesting to see what uses could be made of mined data like WorldCat Identities, for example giving users of individual catalogs information about sources that the library does not hold. It is also a shame that WorldCat Identities appears to have been a one-off and is not being kept up to date.
Emily Dickinson at WorldCat Identities

First Class Objects

A potential that linked data brings (but does not guarantee) is the development of some of the key bibliographic entities into "first class objects". By that I mean that some entities could be the focus of searches on their own, not just as index entries to bibliographic records. Having some entities be first class objects means that, for example, you can have  a page for a person that is truly about the person, not just a heading with the personal name in it. This allows you to present the user with additional information, either similar to WorldCat Identities, if you have that information available to you, or taking text from sources like Wikipedia, like Open Library did:
Open Library author page

This was also the model used in the linked data database Freebase (which has now been killed by Google), and is not entirely unlike Google's use of Wikipedia (and other sources) to create its "knowledge graph."
Google Knowledge Graph

The treatment of some things as first class objects is perhaps a step toward the catalog of headings, but the person as an object is not itself a replication of the heading system that is found in bibliographic records, which go beyond the person's name in their organizational function:
Dickens, Charles, 1812-1870--Adaptations. Dickens, Charles, 1812-1870--Adaptations--Comic books, strips, etc. Dickens, Charles, 1812-1870--Adaptations--Congresses. Dickens, Charles, 1812-1870--Aesthetics. Dickens, Charles, 1812-1870--Anecdotes. Dickens, Charles, 1812-1870--Anniversaries, etc. Dickens, Charles, 1812-1870--Appreciation. Dickens, Charles, 1812-1870--Appreciation--Croatia.
For subject headings, a key aspect of the knowledge map is the inclusion of relationships from broader and narrower terms and related terms. I will not pretend that the existing headings are perfect, as we know they are not, but it is hard to imagine a knowledge organization system that will not make use of these taxonomic concepts in one way or another.
Lake Erie See: Erie, Lake Lake Erie, Battle of, 1813. BT:United States--History--War of 1812--Campaigns Lake Erie, Battle of, 1813--Bibliography. Lake Erie, Battle of, 1813--Commemoration. Lake Erie, Battle of, 1813--Fiction. Lake Erie, Battle of, 1813--Juvenile fiction. Lake Erie, Battle of, 1813--Juvenile literature. Lake Erie Transportation Company≈ See Also: Erie Railroad Company.
This information is now available through the Library of Congress linked data service, id.loc.gov and surely, with some effort, these aspects of the "first class entity" (person, place, topic, etc.) could be recovered and made available to the user. Unfortunately (how often have I said that in these posts?), the subject heading authorities were designed as a model for subject heading creation, not as a full list of all possible subject headings, and connecting the authority file, which contains the relationships between terms, mechanically to the headings in bibliographic records is not a snap. Again, what was modeled for the card catalog and worked well in that technology does not translate perfectly to the newer technologies.

Note that the emphasis on bibliographic entities in FRBR, RDA and BIBFRAME could facilitate such a solution. All three encourage an entity view of data that has traditionally included in bibliographic records and that is not entirely opposed to the concept of the separation of bibliographic data and authorities. In addition, FRBR provides a basis for conceptualizing works and editions (FRBR's expression) as separate entities. These latter exist already in many forms in the "real world" as objects of critical thinking, description, and point of sale. The other emphasis in FRBR is on bibliographic relationships. This has helped us understand that relationships are important, although these bibliographic relationships are the tip of the iceberg if we look at user service as a whole.

Next 

Next I want to talk about possibilities. But because I do not have the answers, I am going to present them in the form of questions - because we need first to have questions before we can posit any answers.