Subject analysis is the most challenging part in cataloguing or metadata creation. Let us see how difficult it is.
Subject analysis normally include the process of determine 1) the subject matter of the information package and 2) determine the appropriate terms to describe the subject matters from the controlled vocabulary lists.
What are the problems in process 1:
We have to find out from the information packages:
1. What is it?
2. What is it for?
3. What is it about?
To answer question 1, we have to find out the fundamental form (the categories) of the knowledge (e.g. sociology) of the information package. To answer question 2, we have to find out the motivation of creation of the information packages: for who, for what purpose? To answer the question 3, we have to find out the topics being discussed. Topics are everyday has a social and time factor.
[Source: Subject Analysis. Chapter 9 in The Organization of Information by Arlene G. Taylor. 2nd ed.]
“Some critical information is represented redundantly. Example: A description of an e-book is spread across the 245, 008, and 300 fields, but is concisely represented in a more modern standard such as ONIX.
Open Library is a project of Internet Archive. “One web page for every book ever published” is the goal of the project.
It is an open project; like wiki that everyone is able to create a book record, and everyone can edit any record. It uses open source technology. Its records are open that there is no ‘completed’ record.
In their home page, it says “Open Library is your library to read, to borrow, and to search”. Open Library has over 20 million edition records, provides 1.7 million scanned versions of books. It links to external sources like WorldCat and Amazon.
“Library catalog data is a language and a communication system all its own. Who else creates things that look like:
xii, 356 p. ; 23 cm.
Few outside of the library world have any idea what that means. What about:
Hamlet. French. 1923
It’s like the secret language of twins, but shared over generations and with tens of thousands of initiates. We can have our secret inner circle, sure, but the library catalog is supposed to be our face to the world. It hasn’t been that. And it’s not going to be that if we continue to see the catalog of resources as our primary interaction with users. Taking the data we have today in the catalog and converting it to linked data would be a repetition of what we did when we went from the catalog card to the MARC format: translating the same functionality into a new data format. It’s not going to make our data any more user friendly or any more useful.
In addition, the catalog primarily organizes things; resources, books, CDs and DVDs. Where we need to go is to KNOWLEDGE ORGANIZATION. Not STUFF organization.”
Library catalogue uses controlled vocabularies in names, subjects, places, forms, etc. The standardised terms help users to find things alike. This is where common web search engines cannot provide. (see previous post)
Here is a video created by a vendor explaining the functions of controlled vocabulary.
“Why does Google (Yahoo, Bing) use keyword searching? Because it’s easy. It is mechanical. It is a match between a string in a query and a string in a database (even with all of its enhancements, that’s the bottom line). It requires no knowledge of the topic, no human intervention, no experts. Keyword searching is NOT knowledge organization. With keyword searching there are no relationships between things. You can’t go broader or narrower; you can’t get “things like thi,”; it doesn’t even have facets. I said before that users are accustomed to the single search box, and many see it as representing freedom – wide open, anything goes. It’s not a freedom, it’s a constraint. It basically constrains the user to try to guess what words will bring up the information you are seeking – which is a bit unfair since the assumption is that there is something the user doesn’t know which is why she is doing a search. The user has to translate what might be a complex information need to a couple of words. And as Elaine Svenonius notes:
“At the same time, it is known that users in their attempts to search by subject
sometimes find themselves at a loss for words.” (Svenonius, The Intellectual Foundation of Information Organization, p. 135)
What works for keyword searching?
nouns, especially proper nouns
– programming languages (Python, Ruby) (Note that you don’t retrieve much about snakes or gems with these searches, showing a particular bias in the content of the Web itself)
– titles of books or essays (Moby Dick)
What doesn’t work?
searching for concepts
searching for things with common terms in their names (library, catalog) (Often when I’m searching for topics relating to libraries I find myself in github.)
you can’t ask a specific question: When did Melville write Moby Dick? You can only put in those terms and hope that a retrieved web page contains the answer. (Wolfram Alpha is trying to address this problem)
Google has all of the knowledge basis of a phone book. You name it, you retrieve it.
Did you ever wonder why so many searches turn up Wikipedia in the first few hits? Wikipedia is ORGANIZED INFORMATION. To me it is the proof that organized information is needed, works, and helps people find and learn. Wikipedia does have pages for concepts, it does have links between related subjects, it IS organized knowledge. How well does keyword searching work? Some analogies:
it’s like dumpster diving for information; you dig through a lot of garbage but you might find a clean, wrapped sandwich
it’s like dynamite fishing; you through dynamite into a lake and see what gets thrown up in the air.
it’s like your grandmother’s button box; you need a button and you can spend ages digging through trying to find one that matches on size and color. Or you can go to the store where they have the buttons in order by size and color, and pay a couple of bucks.
We tend to ignore the false hits and zoom in on the successes. But the main thing is that this imprecise retrieval puts a huge burden on the user, who has to essentially game the system to get retrievals and then has to dig through what comes back to sort wheat from chaff. In his book Everything is Miscellaneous David Weinberg talks about tagging, and says that a search on Flickr for “San Francisco” will bring up photos of a number of different places named San Francisco, but what does that matter? I think it matters, and it matters especially for the least experienced users who find such things confusing. Everything might be miscellaneous but it is also time consuming and annoying.”
“The advantages of linked data are real. The primary reason to move to linked data is that it is a metadata format designed for data that lives on the Web.
… [It] means being able to interact with web resources; it means being visible to web users; it also means being able to take advantage of the web as a platform. This latter cannot be understated: the web is a huge system based on solid technology that is much more reliable than any small organization can create in house. (Google, not a small organization, may have a private network that rivals the web, but the rest of us cannot get anywhere near that.) I’m talking not about ‘web scale,’ but ‘on the web.’
There are other excellent reasons to move to linked data. Linked data provides a flexibility that previous technologies do not. Linked data allows expansion in a non-disruptive way; just as anyone on the web can link to documents on your web site with, anyone can link to your data. And that linking has no effect on what it links to, other than providing new paths of access. Not only can others link to your data, you can link to your own data. This means that you can build up your data incrementally without having to modify your data structure or the systems that manage that data. No more waiting two years to add a code to our record format, then having to coordinate implementation on a broad basis so that all systems are in sync. Systems can ignore “new” data until they are ready to make use of it.
Linked data allows some communities or some segments of communities to add more data where they need to, and still remain compatible with the larger data sharing activity. One can add new controlled lists or expand general lists to express greater detail in some areas. Others don’t have to adjust to that if they don’t need that detail. This flexibility also extends to internationalization. Because linked data uses identifiers instead of names or terms, those identifiers can be presented to natural languages users (e.g. humans) in any language you wish. RDA elements and controlled lists in linked data format are already being translated by interested libraries in Europe. And if there are particular needs in one country or region, those needs can be met by extending the metadata.
We only need to be the same in some areas to achieve linking. Where in the past, when we exchanged records, the record itself had to be a known and controlled unit. With linking, any part of the data can be compatible, but not all of it has to be. You can exchange or link to portions of someone else’s data, say to an author and title in a bibliography. Linked data can be as simple or complex as you wish it to be.
There are the Semantic Web purists who have a fairly rigid notion of linked data, and admittedly the way they express their concept of linked data is abstract and obscure. I also think that the Semantic Web ideal of highly atomized, pure data is unrealistic. That doesn’t negate the value of linked data. It just means that we have a flexible approach to the concept if we are to bring our real-world complexity and messiness into the linked data realm.