<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The web is a knowledge repository&#8230;not!</title>
	<atom:link href="http://www.mihswat.com/2008/10/10/the-web-is-a-knowledge-repositorynot/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mihswat.com/2008/10/10/the-web-is-a-knowledge-repositorynot/</link>
	<description>MIH SWAT - the official blog of MIH's Strategic Worldwide Applications and Technology Team.</description>
	<lastBuildDate>Wed, 01 Sep 2010 08:37:01 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>By: G-J</title>
		<link>http://www.mihswat.com/2008/10/10/the-web-is-a-knowledge-repositorynot/comment-page-1/#comment-41</link>
		<dc:creator>G-J</dc:creator>
		<pubDate>Sat, 11 Oct 2008 09:09:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.mihswat.com/?p=34#comment-41</guid>
		<description>I agree completely: we tend to mistake pretty raw data for information on the web. The few places where we are likely to find subjectively useful information (wikipedia, news sites, knol) have heavy human involvement (and the richer sources of information tend to have more continuous human filtering and tweaking, e.g. the wikis).

It&#039;s interesting to compare this to another definition of information, from Claude Shannon&#039;s information theory. This is a mathematical theory of information that states that the information content of a datum is inversely proportional to its probability of occurance. Shannon (and most communications engineers after him) used this definition in closed data sets, where we know beforehand what data is likely and what data is improbable. For example if we were to design a system that efficiently encodes (&quot;compresses&quot;) English texts, the letter &quot;E&quot; contains low information content, because it occurs 13% of the time. The occurence of the letter &quot;Z&quot; is rare (0.08%), and therefore its actual occurence carries substantial information. This is the basis on which most modern compression and data correction systems are built.

But the question is whether such a formal definition of information breaks down when we talk about knowledge as information. In a sense, yes, because the subjective probability of a datum differs from person to person. High-information data would be data that an individual finds suprising or new; low-information data would be obvious or common-sense. However, Shannon&#039;s probability-based definition could still be a useful way of thinking about subjective information content. From this point of view, information content is inversely proportional to the probability that data is already accommodated in an individual&#039;s knowledge framework. Data that changes this framework, that needs to be assimilated in some way, has high information content. Interestingly, data with too high information content, in other words, data that differs too much from an individual&#039;s framework, may be rejected outright, and no useful information can be conveyed.

So does this give us any indication on how to allow users to gather more information from the web, with less knowledge-brokering by other humans? I think so, yes: but it points towards two difficult problems.

The first one stems from the subjectivity of information, and can only be solved if our information services can model a user&#039;s knowledge framework in some or other way. In other words, user profiling must develop beyond simple clustering of favourite books and movies to a deeper cognitive model. Humans are good at this -- good teachers and lecturers are those people with a knack for developing &quot;thought models&quot; of their students, and then offer data that maximises the transfer of information. Software systems are still pretty bad at this (but one of the better ones was designed by a psychologist, not an engineer, see http://snipurl.com/49jzu).

The second difficult problem is to extract semantic information, meaning, from the data cluttering the web. If you want to know whether data would carry high information for a user, you need a knowledge representation, and a way of comparing the semantic content of a page or a document or an idea to a user&#039;s knowledge profile. Semantic analysis still has a long way to go before this is possible.

But although the general problem is difficult to solve, I think an information-theory approach can be feasible already in limited-domain applications. For example, if a user&#039;s knowledge framework kan be estimated by cues such as observed interests and observed interaction with existing data within a specific domain (e.g. a news site, or an online encyclopedia), and if a knowledge ontology can be created for that domain, and available data clustered within that ontology, it might be possible to build some very useful applications. The difference from many current systems is that we shouldn&#039;t continue to feed a user with data already within his framework (many recommendation systems seem to hammer me continuously with slight permutations of data or products that I already have), but with data this differs enough from the existing framework to be interesting.</description>
		<content:encoded><![CDATA[<p>I agree completely: we tend to mistake pretty raw data for information on the web. The few places where we are likely to find subjectively useful information (wikipedia, news sites, knol) have heavy human involvement (and the richer sources of information tend to have more continuous human filtering and tweaking, e.g. the wikis).</p>
<p>It&#8217;s interesting to compare this to another definition of information, from Claude Shannon&#8217;s information theory. This is a mathematical theory of information that states that the information content of a datum is inversely proportional to its probability of occurance. Shannon (and most communications engineers after him) used this definition in closed data sets, where we know beforehand what data is likely and what data is improbable. For example if we were to design a system that efficiently encodes (&#8220;compresses&#8221;) English texts, the letter &#8220;E&#8221; contains low information content, because it occurs 13% of the time. The occurence of the letter &#8220;Z&#8221; is rare (0.08%), and therefore its actual occurence carries substantial information. This is the basis on which most modern compression and data correction systems are built.</p>
<p>But the question is whether such a formal definition of information breaks down when we talk about knowledge as information. In a sense, yes, because the subjective probability of a datum differs from person to person. High-information data would be data that an individual finds suprising or new; low-information data would be obvious or common-sense. However, Shannon&#8217;s probability-based definition could still be a useful way of thinking about subjective information content. From this point of view, information content is inversely proportional to the probability that data is already accommodated in an individual&#8217;s knowledge framework. Data that changes this framework, that needs to be assimilated in some way, has high information content. Interestingly, data with too high information content, in other words, data that differs too much from an individual&#8217;s framework, may be rejected outright, and no useful information can be conveyed.</p>
<p>So does this give us any indication on how to allow users to gather more information from the web, with less knowledge-brokering by other humans? I think so, yes: but it points towards two difficult problems.</p>
<p>The first one stems from the subjectivity of information, and can only be solved if our information services can model a user&#8217;s knowledge framework in some or other way. In other words, user profiling must develop beyond simple clustering of favourite books and movies to a deeper cognitive model. Humans are good at this &#8212; good teachers and lecturers are those people with a knack for developing &#8220;thought models&#8221; of their students, and then offer data that maximises the transfer of information. Software systems are still pretty bad at this (but one of the better ones was designed by a psychologist, not an engineer, see <a href="http://snipurl.com/49jzu)" rel="nofollow">http://snipurl.com/49jzu)</a>.</p>
<p>The second difficult problem is to extract semantic information, meaning, from the data cluttering the web. If you want to know whether data would carry high information for a user, you need a knowledge representation, and a way of comparing the semantic content of a page or a document or an idea to a user&#8217;s knowledge profile. Semantic analysis still has a long way to go before this is possible.</p>
<p>But although the general problem is difficult to solve, I think an information-theory approach can be feasible already in limited-domain applications. For example, if a user&#8217;s knowledge framework kan be estimated by cues such as observed interests and observed interaction with existing data within a specific domain (e.g. a news site, or an online encyclopedia), and if a knowledge ontology can be created for that domain, and available data clustered within that ontology, it might be possible to build some very useful applications. The difference from many current systems is that we shouldn&#8217;t continue to feed a user with data already within his framework (many recommendation systems seem to hammer me continuously with slight permutations of data or products that I already have), but with data this differs enough from the existing framework to be interesting.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
