Right well this is a long piece this time but I’ve been trying to get it out for a long time but hey here it is. Firstly I posted a couple of questions to Jeremie Miller recently and the Qs and As are posted below. Also further below is a mailing list email from Jer which includes many more details about Grub and it’s progress.
Q: What do you see happening in Grub’s near future?
A: The ability to get immediate feedback, upload URLs and download crawl snapshots, etc more usability of the functions and a better protocol to make deving clients easier.
Q: Grub is currently being used to index the net but is anything else in the pipeline for the rest of the project (i.e. Atlas)?
A: Yeah, Grub is to be the best crawler and for a social good, anyone can benefit from it’s results. As for quality indexing, I hope there’s multiple projects we can start around that some of it being natural language, some of it being good ranking, some of it being scaling across many computers, etc but they all depend on a good source of data, that’s what Grub has to do really well first.
Q: What are you currently working on?
A: Figuring out how to get grub.org running in a more production mode, and exploring it’s source code. Also, some prototype source code for Atlas stuff, some testing tools, but that’s a few weeks away.
Grub Development post from Jer
A lot more info was released in a post to the Grub Development mailing list and is included below.
I’ve been meaning to send out the low-down on all the Grubbing going
on the past month or so, and some ideas for where it’s all going,
feel free to ask if I don’t answer anything anyone might want to know
First, most everyone should have noticed the global stats are working
finally, and in having them up you can see the service still goes
down semi-frequently. We’ve got the entire thing “throttled down” as
far as it will go and it’s still crawling millions of urls daily and
filling up the 30GB partition it’s caged in for testing
So, some things learned about the current Grub system:
* it’s not recursive (doesn’t automatically discover/inject new urls)
* it is capable of obeying robots when injected with them
* only grabs text/html right now
* uses a simple checksum to look for changes
* doesn’t track ETag or Last-Modified (pretty major flaws IMO)
* was over-engineered for modularity
* uses a rather obtuse SOAP encoding
* stores crawl results in it’s own also obtuse encoding
Hmm, that’s enough pain to start with… so the very first goal was to get the crawl output in a more usable format, the Internet Archive ARC format (http://www.archive.org/web/researcher/ArcFileFormat.php ). That happened this week, and now the work-unit binary blobs are being converted into much more useful ARC files automatically, yay!
The next step is to get a lot more URLs loaded, there’s about a
million total that exist right now, basically a random sample, and
we’re churning though those a few times a day. I have extracted over
16 million more urls from a wikipedia snapshot and before they get
loaded they have to go through a robots check/import, that’s the goal
for this week. Once there’s a solid base of URLs, the hope is to
then start extracting new/discovered ones from the resulting ARC
files on the output, keep building on itself.
Moving up to the big picture, the overall goal here is to focus Grub
on being a completely open both on the input and output, a shared
crawling resource for use by anyone. More specifically, to turn the
administration into an open wiki where anyone can suggest new URLs,
review existing URLs, create site policies, and view crawl stats and
samples for any set. On the output anyone will be able to grab the
latest cached copies of individual URLs, get entire snapshots/sets as
they happen, or even build custom jobs to filter through and grab
copies of just what they need. I’ll take some time to get all this
together of course