Knowledge Engines - A Formal Proposition

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Search engine technologies have been discussed to death and further devel-
oped endlessly in the past decade. However, such engines have no so-called
"thirst  for knowledge", but rather a thirst for text. We continue to live
in  an age where best results for a query are produced given an input com-
prising  keywords. The outcome, rather than answers or self-tailored  con-
tent,  is merely a linear collection of pages, whose static content resem-
bles  the  keywords. There is no way to guarantee, nonetheless, that  such
pages  will provide the desired information or provide information that is

Iuron  is set to become a collection of tools for knowledge engines, which
are  intended to crawl the World Wide Web. The aim is to create a semantic
entity that captures facts from a large number of pages, thereby providing
an  intelligent  front-end for user search. Results are generated 'on  the
fly' based on acquired knowledge and are solely intended to serve individ-
ual users.


Let us think of the Internet as a collection of complex, inter-related in-
formation.  More cohesively, it takes an immense number of hypotheses  and
thus  can  contain  valid, consistent knowledge. Although we  can  process
(scan)  all the information, higher-level knowledge, which is derived from
collection  of  pages, is still missing. There is enough knowledge  across
the World Wide Web to answer more or less any question, assuming it is not
subjective.  All that is done at present is word indexing with the  notion
of work proximity.

Let  us face the fact that, among the more popular uses of search engines,
are pursuits for commercial companies, which provide products or services.
Results  that get returned by the engines sometimes correspond to the most
valid  and relevant authority for a given niche. This may be fine for  in-
sight  into  magnitude and breadth of companies (or their Web sites),  but
this equally often misleads the user.

Search engines at present fail to extend beyond a potentially morbid state
of  "dominance  prevails". Rather than an engine that provides users  with
the  most reasonable answer and/or reference to a site, it provides a  Web
link  to what is most cited, typically due to fraudulent practices or sub-
jective search engine optimisations.

All  the  all, search engines at present encourage link-related  spam  and
content-related spam. In worse scenarios, their backlinks-based algorithms
lead  to  rise in sponsored listings, whereas our natural incentive is  to
prefer what would "work best for us", not what got recommended by automat-
ed  tools. These tools, which work at a shallow level without  understand-
ing,  opt to prioritise large corporations with money to be spent on  good
listings and inbound links.

Iuron is a project that addresses the issues above. First and foremost, it
converts  the vast amount of information in the World Wide Web into facts.
Moreover,  it serves as an impartial source for answers and is not  highly
susceptible to deceit as it can discern true from false.


There  are a variety of plausible ideas, which have been expressed at some
depth  in the manifestation document alongside their pitfalls. To name one
of them briefly, pages should be obtained from the World Wide Web and then
reduced to a set of facts. Facts will be assigned varying weighs depending
on credibility factors. Frequently-repeated facts will be encouraged while
falsified  facts  discouraged  or altogether rejected.  First-order  logic
serves as the holy grail by which a sequence of words (elements) becomes a
set of arguments with associated semantics.


The fundamental approach to tacking the problem is not overly complicated.
The  goal is certainly feasible, while the resources to make it  practical
are the primary barrier.

Since  Iuron is an Open Source project, rapid assemblage and  construction
of the libraries would be rapid, making use of existing projects that fall
under  the  General Public Licence (GPL). In return, Iuron will provide  a
potentially  distributed environment, wherein any idle computer across the
world  can assist crawling and report back to a main knowledge repository.
Think  of it as a public-driven reciprocal effort to process and then cen-
tralise human knowledge.

Re: Knowledge Engines - A Formal Proposition

On Tue, 18 Oct 2005 13:14:09 +0100, Roy Schestowitz
Formal Proposition":

Quoted text here. Click to load it

I read on the BBC website yesterday...

"If consumers see a perceptible quality difference [with rival search
engines], they will disappear," admits Mr Arora.

We presume he means, if the search results are not relevent to what someone
is searching for...they will look elsewhere.

Quoted text here. Click to load it

At this moment in time, the Majestic12 project is doing this (which I
believe is also open source). Using distributed computing power to crawl the
web, using a C# based spider. It seemed to devour a few thousand pages on
one of my sites pretty damn quickly.

How do you intend to split the computing work load on this.. in that. Do you
intend the spider simply to crawl large numbers of sites for "all" data,
then let the user interface determine fact applicability? Or, do you want
the spider to extract facts?

I'm interested to know how you intend to discern a fact from a web page


There is a fruit in the reply email address
Google Tools:
Better Driving:

Re: Knowledge Engines - A Formal Proposition

__/ [Darren Tipton] on Sunday 23 October 2005 10:49 \__

Quoted text here. Click to load it

Someone in this group has recently pointed out better relevancy in Yahoo. I
personally disagree, but fingers point at different directions, which raises
one's brow.

Quoted text here. Click to load it

This must be the reason why the majority of people opt for Google despite the
default homepage, which tends to be

Quoted text here. Click to load it

I remembered to acknowledge Majestic12 yesterday and I suspect it was you who
< >
pointed out the site in the first place. There should be no "lust for
images", at least not initially. Having mentioned images, I do my research
in the field of computer vision, so there might be provision for image
analysis, classification and labelling too. Machine learning is quite
well-developed in that respect. Again, crawling should never rely on
captions as these can be intentionally deceiving.

I once had this idea of allowing users to describe an image that they seek
and then < >
fetch most relevant images from the Web.

A computer will have a reliable understanding of image contents ?
understanding that surpasses the human eye and mind. Want a photo that which
a Labrador drinking water from a fountain on a sunny day? Want a descriptive
verbal interpretation of a given distorted image? This will probably be
practical in the distant future. It is a machine learning/pattern
recognition task taken to extremity.

Quoted text here. Click to load it

If you allow humans to intervene, the process become labour-intensive and
subjective. It's better to make the entire framework autonomous and
self-maintaining. Lies and mistakes, however, are an issue, especially urban
legends that tend to repeat themselves.

As for load management, one can always set hit intervals. The need to crawl
pages time after time is only crucial when you seek up-to-the-minute facts
that can be encouraged instantly. Thus, you can have some analysis of
momentary trends... much like tag clouds.

Quoted text here. Click to load it

The sources can be selective at the start (e.g. Wikipedia) and the crawlers
then extend with caution. There are words that characterise factual science,
for example, and tell that apart from a 'breakfast and lunch blog'.

I would toss a number 'off the sleeve' and say that the proportion of pages
which are publicly-available spam is worrying. Ping traffic is possibly 80%
spam, so I believe the same might hold for content spam. Much of it just
doesn't get crawled. Spammers can produce 100 times the number of pages of
genuine sites if they put their minds to it. The secret lies in filtering,
which in itself is a machine learning task.


Site Timeline