Google images -- what won't they index?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I don't mean content, I mean what technical things about the way an
image is presented on a page will prevent them from indexing them?
And I mean in, NOT the basic web search.

No money rides on this (for me, I mean; I'm sure some people have
business models that depend on google indexing their images), I'm just

My online images are mostly presented in thumbnail pages (static html;
generated from a script just once, not on-demand), with the thumbnails
linking to a script that produces a page with the full-size image and
other associated information (caption, tech info, navigation links to
walk through the gallery without returning to the thumbnail;
conventional stuff for a photo gallery).  

And the full-size images mostly don't get indexed.  In a few places
where I've put up a static page with inline images, they *do* get
indexed -- suggesting that Google is willing to index images on my
site.  And the logs show that google does spider me pretty regularly.

I've been poking slowly at my problem here for several years now.  I
suspect that google is unhappy with some of the headers my script for
the full-size image page produces.  I've been playing with those,
trying to get them as vanilla as possible.  In particular I've made
sure that I generate a reasonable "last-modified:", and I've put in a
"cache-control: public" (I have no reason to believe google pays
attention to that, but a number of cache strategies don't cache
dynamic pages unless explicitly told to, so it's the right thing for
me to do on these pages).  And that the script responds correctly to
an if-modified-since header (and in fact I've got in the log a recent
example of googlebot receiving a 304 response on the URL of one of
these pages, so googlebot does send if-modified-since).

And google clearly indexes stuff in dynamic photo albums -- it's easy
to find examples.  I've looked at the headers that Gallery, for
example, generates, and I don't see anything "wrong" with mine, but
there are cetainly differences.  

Has anybody worked through this issue and has a clear characterization
of what Google images (images; not the main google web search) will
and won't index?  And is willing to share?

There's some slight indications that my latest round of script changes
is, maybe, now something they're willing to index, but it's only been
up a few days so no results are in the actual index yet.
RKBA: < <
Pics: < <
Dragaera/Steven Brust: <

Re: Google images -- what won't they index?

__/ [David Dyer-Bennet] on Saturday 17 September 2005 19:00 \__

Quoted text here. Click to load it

You can use metedata to prevent crawling. You must ensure that crawlers have
no path via which they descend to images. Image file themselves have little
or no information associated with them (magic number et cetera?).

Quoted text here. Click to load it

Thumbnails will often get indexed before the full-sized image equivalents. I
think that's rather intuitive. Just as a viewer sees thumbnails first, so
will a crawler (bot).

Wait until more crawling is completed. That would be my advice. Images index
in Google gets refresh 2 or 3 times a year, but crawling remains persistent
all year.

Quoted text here. Click to load it

How big are the full-sized images? I have seen galleries with 4 MB JPEG's
that are barely compressed or lossy.

Quoted text here. Click to load it

I suggest that you don't intevene until you get definite answers. You might
be throwing away valuable time, damaging your site in the process.

What is the nature of these scripts?

Quoted text here. Click to load it

I use Gallery too and Google appears to index it properly. The mistake I
once made was that I changed album names (slugs). Even 4 months later, I
still get many 404's as a result. I don't think there is something
inherently bad with Gallery and its interaction with bots.

Quoted text here. Click to load it

Here are a few observations:

-Google bind image descriptions (keywords), or vice versa, to the page title
and headers, probably using some word density tests and finding captions.
They also appear to be useing the name (filename) of the image. I have no
evidence to suggest that the alt attribute gets used much, if at all, in
the assigment to keywords.

-Google Images prefers to return large and clear images to Google Images
users. It does not return thumbnails too often. Moreover, it might be able
to detect presence of small/larger version and make a senseible choice,
removing duplicates in the process.

Quoted text here. Click to load it

Experimentation like that needs to account for many more factor. You cannot
just isolate one. It's like a high-dimensional problem with so many
parameters (PageRank, algorithm changes and so forth). You can fall under a
self-imposed illusion at best.

As I said before, it may take quite a few months for the index to get
modified. The past Google Images update was around 1-2 months ago. I can
clearly remember announcing it in

Hope it helps,


Roy S. Schestowitz      | Proprietary cripples communication  |    SuSE Linux    |     PGP-Key: 74572E8E
  1:10pm  up 24 days  1:24,  3 users,  load average: 0.31, 0.38, 0.72

Re: Google images -- what won't they index?

Quoted text here. Click to load it

And since I *want* these indexed (well, not desperately; but would
prefer), I have avoided all these things that would block it.
robots.txt doesn't block those directories, no meta tags "no-index" or
whatever that one is (I generally just use robots.txt), etc.  

Actually, image files can have a lot of text information in them these
days, or at least they can -- EXIF and IPTC data in particular.

Quoted text here. Click to load it

This ongoing issue goes back many years for me.

Quoted text here. Click to load it

Mostly under 60k, a very few up to 200k.  

Quoted text here. Click to load it

But if I *don't* ever play with things, I can be sure things *won't*
ever improve.  As I say, I've been chipping away slowly at this for
the last several years.  Some of these galleries originated on the web
in their original form in 1994 or 1993.  

Quoted text here. Click to load it

There's basically one script that's currently in question, which I
call "picpage" (it'll be useful, perhaps, to be able to refer to it
specifically by name if the discussion continues).  It's invoked when
a visitor clicks on a thumbnail on the index page for the gallery, and
it sends a page containing some navigation, some information about the
image from various sources, and the "big" (screen resolution) image

For example, <
is the thumbnail page for a recent set of macro photos I took.  (That
page is static HTML sitting on the server; that HTML was generated by
a script when I created this gallery, but I don't think that's
relevant -- the HTML is generated by the perl CGI library, and looks
ordinary enough to me.)  

If you click on the "Bee shadow on morning glory" thumbnail, the URL
fetched is
< .
This invokes the picpage script ("gal" is configured as a mod_perl
directory), the additional path information specifies which directory,
and the id parameter specifies which image in that directory.  Picpage
generates and sends a page to display that one photo, with caption and
other information, and with navigation links, in particular to the
previous and next image in that gallery.

Quoted text here. Click to load it

Yes, it seems to me that it *doesn't* encourage caching.  When I hit
the browser back button from viewing an image to get to the Gallery
thumbnail page, it's clearly regenerating and resending the page.
This makes the performance suck, and loads down the server.  

(I've run Gallery for some users on my server, and it's certainly a
pig from the server end!  And it's much better than my scripts for
some people for some uses -- notably for ignorant users.  I'm not
trying to please anybody but myself on the gallery-maintainer end of
my own scripts, luckily!)

Quoted text here. Click to load it

It's interesting that they appear to prefer my (few) inline galleries,
where half a dozen images are displayed on a single page, rather than
the picpage pages where each photo is alone on the page and both the
page title and a DIV directly above the image give a short
description.  In fact that's part of why I've been suspecting that the
problem is somehow a result of the dynamic serving of the pages,
rather than of the HTML content.

Quoted text here. Click to load it

Yes, I'm vividly aware of that.  I don't *think* I have any definite
beliefs about what they do that are wrong -- but that's mostly because
I have so few really *definite* beliefs about what they do :-).

I know the turnaround time on experiments is long, and that it's not
always easy to even *tell* when the result of an experiment is now
visible (obviously, if the experiment changes the search results,
that's easy to detect, but many unsuccessful things one tries *don't*
change the results).

Quoted text here. Click to load it

The information on how infrequently they update the image database in
particular is *immensely* useful, thanks very much!  I wasn't exactly
*assuming* they updated as frequently as the text search side, but I
didn't know it was that big a difference.

And the other observations at least help me believe I'm not off on a
completely blind alley or an insane quest; though they don't open up
any new areas for me to consider.

And that sounds like a newsgroup I need to cast an eye over, thanks!

Quoted text here. Click to load it

Yes, greatly appreciated.
RKBA: < <
Pics: < <
Dragaera/Steven Brust: <

Re: Google images -- what won't they index?

A second followup, since things in Google images changed today!

Quoted text here. Click to load it

I ran another query in google images today, and
suddenly a group of images, mostly from my father's memorial site,
which were never there before, are now there!  

Furthermore, the images all seem to be things that something calling
itself "Googlebot-Image" crawled *in the last 5 days*.  This last is
*probably* just coincidence, or else I got lucky at just barely being
included in the crawling before they updated their database.

But anyway, the previous situation where *nothing* displayed via my
picpage script was in Google Images has changed.  I feel, at least,
less hopeless!
RKBA: < <
Pics: < <
Dragaera/Steven Brust: <

Site Timeline