Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Posted on
- Knowledge Engines - A Formal Proposition
- Roy Schestowitz
October 18, 2005, 5:14 pm
rate this thread
Search engine technologies have been discussed to death and further devel-
oped endlessly in the past decade. However, such engines have no so-called
"thirst for knowledge", but rather a thirst for text. We continue to live
in an age where best results for a query are produced given an input com-
prising keywords. The outcome, rather than answers or self-tailored con-
tent, is merely a linear collection of pages, whose static content resem-
bles the keywords. There is no way to guarantee, nonetheless, that such
pages will provide the desired information or provide information that is
Iuron is set to become a collection of tools for knowledge engines, which
are intended to crawl the World Wide Web. The aim is to create a semantic
entity that captures facts from a large number of pages, thereby providing
an intelligent front-end for user search. Results are generated 'on the
fly' based on acquired knowledge and are solely intended to serve individ-
Let us think of the Internet as a collection of complex, inter-related in-
formation. More cohesively, it takes an immense number of hypotheses and
thus can contain valid, consistent knowledge. Although we can process
(scan) all the information, higher-level knowledge, which is derived from
collection of pages, is still missing. There is enough knowledge across
the World Wide Web to answer more or less any question, assuming it is not
subjective. All that is done at present is word indexing with the notion
of work proximity.
Let us face the fact that, among the more popular uses of search engines,
are pursuits for commercial companies, which provide products or services.
Results that get returned by the engines sometimes correspond to the most
valid and relevant authority for a given niche. This may be fine for in-
sight into magnitude and breadth of companies (or their Web sites), but
this equally often misleads the user.
Search engines at present fail to extend beyond a potentially morbid state
of "dominance prevails". Rather than an engine that provides users with
the most reasonable answer and/or reference to a site, it provides a Web
link to what is most cited, typically due to fraudulent practices or sub-
jective search engine optimisations.
All the all, search engines at present encourage link-related spam and
content-related spam. In worse scenarios, their backlinks-based algorithms
lead to rise in sponsored listings, whereas our natural incentive is to
prefer what would "work best for us", not what got recommended by automat-
ed tools. These tools, which work at a shallow level without understand-
ing, opt to prioritise large corporations with money to be spent on good
listings and inbound links.
Iuron is a project that addresses the issues above. First and foremost, it
converts the vast amount of information in the World Wide Web into facts.
Moreover, it serves as an impartial source for answers and is not highly
susceptible to deceit as it can discern true from false.
There are a variety of plausible ideas, which have been expressed at some
depth in the manifestation document alongside their pitfalls. To name one
of them briefly, pages should be obtained from the World Wide Web and then
reduced to a set of facts. Facts will be assigned varying weighs depending
on credibility factors. Frequently-repeated facts will be encouraged while
falsified facts discouraged or altogether rejected. First-order logic
serves as the holy grail by which a sequence of words (elements) becomes a
set of arguments with associated semantics.
The fundamental approach to tacking the problem is not overly complicated.
The goal is certainly feasible, while the resources to make it practical
are the primary barrier.
Since Iuron is an Open Source project, rapid assemblage and construction
of the libraries would be rapid, making use of existing projects that fall
under the General Public Licence (GPL). In return, Iuron will provide a
potentially distributed environment, wherein any idle computer across the
world can assist crawling and report back to a main knowledge repository.
Think of it as a public-driven reciprocal effort to process and then cen-
tralise human knowledge.
Re: Knowledge Engines - A Formal Proposition
I read on the BBC website yesterday...
"If consumers see a perceptible quality difference [with rival search
engines], they will disappear," admits Mr Arora.
We presume he means, if the search results are not relevent to what someone
is searching for...they will look elsewhere.
At this moment in time, the Majestic12 project is doing this (which I
believe is also open source). Using distributed computing power to crawl the
web, using a C# based spider. It seemed to devour a few thousand pages on
one of my sites pretty damn quickly.
How do you intend to split the computing work load on this.. in that. Do you
intend the spider simply to crawl large numbers of sites for "all" data,
then let the user interface determine fact applicability? Or, do you want
the spider to extract facts?
I'm interested to know how you intend to discern a fact from a web page
There is a fruit in the reply email address
Google Tools: http://www.tippy.org/pr_compare.php
Better Driving: http://www.advanced-driving.co.uk/bb
Re: Knowledge Engines - A Formal Proposition
Someone in this group has recently pointed out better relevancy in Yahoo. I
personally disagree, but fingers point at different directions, which raises
This must be the reason why the majority of people opt for Google despite the
default homepage, which tends to be msn.com.
I remembered to acknowledge Majestic12 yesterday and I suspect it was you who
< http://schestowitz.com/Weblog/archives/2005/10/22/collaborative-crawl/ >
pointed out the site in the first place. There should be no "lust for
images", at least not initially. Having mentioned images, I do my research
in the field of computer vision, so there might be provision for image
analysis, classification and labelling too. Machine learning is quite
well-developed in that respect. Again, crawling should never rely on
captions as these can be intentionally deceiving.
I once had this idea of allowing users to describe an image that they seek
and then <
fetch most relevant images from the Web.
A computer will have a reliable understanding of image contents ?
understanding that surpasses the human eye and mind. Want a photo that which
a Labrador drinking water from a fountain on a sunny day? Want a descriptive
verbal interpretation of a given distorted image? This will probably be
practical in the distant future. It is a machine learning/pattern
recognition task taken to extremity.
If you allow humans to intervene, the process become labour-intensive and
subjective. It's better to make the entire framework autonomous and
self-maintaining. Lies and mistakes, however, are an issue, especially urban
legends that tend to repeat themselves.
As for load management, one can always set hit intervals. The need to crawl
pages time after time is only crucial when you seek up-to-the-minute facts
that can be encouraged instantly. Thus, you can have some analysis of
momentary trends... much like tag clouds.
The sources can be selective at the start (e.g. Wikipedia) and the crawlers
then extend with caution. There are words that characterise factual science,
for example, and tell that apart from a 'breakfast and lunch blog'.
I would toss a number 'off the sleeve' and say that the proportion of pages
which are publicly-available spam is worrying. Ping traffic is possibly 80%
spam, so I believe the same might hold for content spam. Much of it just
doesn't get crawled. Spammers can produce 100 times the number of pages of
genuine sites if they put their minds to it. The secret lies in filtering,
which in itself is a machine learning task.