10/10/08

The web is a knowledge repository…not!

by Jacques van Niekerk

Does it appear redundant to say that the web is a knowledge repository? Apparently yes – we all know that the web is the fount of knowledge, if not wisdom. But this statement bears closer examination.

I want to take a closer look at the meaning of the term knowledge, and to discuss the relationship between knowledge, information and data. As “web-practitioners” we are information merchants more than anything else – we make information pretty, we tweak it and we sell it in the form of databases, websites, social networks etc. We really should understand the stuff we are dealing in.

To start – how do sociologists see information and knowledge? (I know – you are not a sociologist, you are a technologist – but consider: we are part of a discipline that deals with people. We need sociology!) Well, academic research suggests that sociologists tend to conflate the meaning of the concepts of knowledge and information, meaning that they draw very little distinction between the two concepts, substituting the terms freely. What then is the distinction between the meanings of these terms? To give an answer, I have to go back yet another step – and take look at data itself.

It is important to understand that we are immersed in data – and that we perceive only a very small part of what surrounds us. Data comes into existence because of changes in the real world. In order for us to make sense of data, we process the input, and end up with information. The actual mechanisms we use to process the data is the field of study of neuroscience and psychology – just note here that the process is subjective. The information that is extracted from data is different for every person. So what then is knowledge?

Knowledge is what the human being (or the data processing agent, if you want to be a little more generic) has learned from previous information extraction exercises. And knowledge is continuously modified by further information.

Now we have the whole picture – data is everywhere: we are immersed in data, some of which we can perceive, some of which we are completely unaware of. Information is what we extract from data, and it is subjective – my information differs from yours (although it CAN be the same). Finally – knowledge is what we have learned, and it is continuously shaped by the arrival of more information…all of which means that knowledge is subjective, too.

Can the web be a knowledge repository? This might not be the right view to take. The web is really a collection of data points – it is not information, it is not knowledge. The task of the web practitioner is to turn this stuff into information. We do this by creating applications that are windows on the web. I can hear the question already – where does the semantic web fit into all this? Think of the semantic web as part of the effort to turn data into information. The semantic web links data in ways that allow us to more easily process and integrate the data to obtain information.

Finally – where are the knowledge players? Right now, we human beings are it. Knowledge is in our heads, not in the web. As technologists we should assist people to turn data into information, thereby creating knowledge.

Web practitioners are first and foremost knowledge workers – proving that once again, people matter more than technology does.

For academic research, look up the work of Max H Boisot – a good starting point for the KM view on information processing.

Related posts:

  1. Introducing RESTful Semantic Web Services

One Response to “The web is a knowledge repository…not!”

  1. G-J Says:

    I agree completely: we tend to mistake pretty raw data for information on the web. The few places where we are likely to find subjectively useful information (wikipedia, news sites, knol) have heavy human involvement (and the richer sources of information tend to have more continuous human filtering and tweaking, e.g. the wikis).

    It’s interesting to compare this to another definition of information, from Claude Shannon’s information theory. This is a mathematical theory of information that states that the information content of a datum is inversely proportional to its probability of occurance. Shannon (and most communications engineers after him) used this definition in closed data sets, where we know beforehand what data is likely and what data is improbable. For example if we were to design a system that efficiently encodes (“compresses”) English texts, the letter “E” contains low information content, because it occurs 13% of the time. The occurence of the letter “Z” is rare (0.08%), and therefore its actual occurence carries substantial information. This is the basis on which most modern compression and data correction systems are built.

    But the question is whether such a formal definition of information breaks down when we talk about knowledge as information. In a sense, yes, because the subjective probability of a datum differs from person to person. High-information data would be data that an individual finds suprising or new; low-information data would be obvious or common-sense. However, Shannon’s probability-based definition could still be a useful way of thinking about subjective information content. From this point of view, information content is inversely proportional to the probability that data is already accommodated in an individual’s knowledge framework. Data that changes this framework, that needs to be assimilated in some way, has high information content. Interestingly, data with too high information content, in other words, data that differs too much from an individual’s framework, may be rejected outright, and no useful information can be conveyed.

    So does this give us any indication on how to allow users to gather more information from the web, with less knowledge-brokering by other humans? I think so, yes: but it points towards two difficult problems.

    The first one stems from the subjectivity of information, and can only be solved if our information services can model a user’s knowledge framework in some or other way. In other words, user profiling must develop beyond simple clustering of favourite books and movies to a deeper cognitive model. Humans are good at this — good teachers and lecturers are those people with a knack for developing “thought models” of their students, and then offer data that maximises the transfer of information. Software systems are still pretty bad at this (but one of the better ones was designed by a psychologist, not an engineer, see http://snipurl.com/49jzu).

    The second difficult problem is to extract semantic information, meaning, from the data cluttering the web. If you want to know whether data would carry high information for a user, you need a knowledge representation, and a way of comparing the semantic content of a page or a document or an idea to a user’s knowledge profile. Semantic analysis still has a long way to go before this is possible.

    But although the general problem is difficult to solve, I think an information-theory approach can be feasible already in limited-domain applications. For example, if a user’s knowledge framework kan be estimated by cues such as observed interests and observed interaction with existing data within a specific domain (e.g. a news site, or an online encyclopedia), and if a knowledge ontology can be created for that domain, and available data clustered within that ontology, it might be possible to build some very useful applications. The difference from many current systems is that we shouldn’t continue to feed a user with data already within his framework (many recommendation systems seem to hammer me continuously with slight permutations of data or products that I already have), but with data this differs enough from the existing framework to be interesting.

Leave a Reply