Archive for the ‘Grub’ Category

Short interview with Jeremie Miller

September 26, 2007

Right well this is a long piece this time but I’ve been trying to get it out for a long time but hey here it is. Firstly I posted a couple of questions to Jeremie Miller recently and the Qs and As are posted below. Also further below is a mailing list email from Jer which includes many more details about Grub and it’s progress.

Interview

Q: What do you see happening in Grub’s near future?
A: The ability to get immediate feedback, upload URLs and download crawl snapshots, etc more usability of the functions and a better protocol to make deving clients easier.

Q: Grub is currently being used to index the net but is anything else in the pipeline for the rest of the project (i.e. Atlas)?
A: Yeah, Grub is to be the best crawler and for a social good, anyone can benefit from it’s results. As for quality indexing, I hope there’s multiple projects we can start around that some of it being natural language, some of it being good ranking, some of it being scaling across many computers, etc but they all depend on a good source of data, that’s what Grub has to do really well first.

Q: What are you currently working on?
A: Figuring out how to get grub.org running in a more production mode, and exploring it’s source code. Also, some prototype source code for Atlas stuff, some testing tools, but that’s a few weeks away.

Grub Development post from Jer

A lot more info was released in a post to the Grub Development mailing list and is included below.

I’ve been meaning to send out the low-down on all the Grubbing going
on the past month or so, and some ideas for where it’s all going,
feel free to ask if I don’t answer anything anyone might want to know
here )

First, most everyone should have noticed the global stats are working
finally, and in having them up you can see the service still goes
down semi-frequently. We’ve got the entire thing “throttled down” as
far as it will go and it’s still crawling millions of urls daily and
filling up the 30GB partition it’s caged in for testing )

So, some things learned about the current Grub system:
* it’s not recursive (doesn’t automatically discover/inject new urls)
* it is capable of obeying robots when injected with them
* only grabs text/html right now
* uses a simple checksum to look for changes
* doesn’t track ETag or Last-Modified (pretty major flaws IMO)
* was over-engineered for modularity
* uses a rather obtuse SOAP encoding
* stores crawl results in it’s own also obtuse encoding

Hmm, that’s enough pain to start with… so the very first goal was to get the crawl output in a more usable format, the Internet Archive ARC format (http://www.archive.org/web/researcher/ArcFileFormat.php ). That happened this week, and now the work-unit binary blobs are being converted into much more useful ARC files automatically, yay!

The next step is to get a lot more URLs loaded, there’s about a
million total that exist right now, basically a random sample, and
we’re churning though those a few times a day. I have extracted over
16 million more urls from a wikipedia snapshot and before they get
loaded they have to go through a robots check/import, that’s the goal
for this week. Once there’s a solid base of URLs, the hope is to
then start extracting new/discovered ones from the resulting ARC
files on the output, keep building on itself.

Moving up to the big picture, the overall goal here is to focus Grub
on being a completely open both on the input and output, a shared
crawling resource for use by anyone. More specifically, to turn the
administration into an open wiki where anyone can suggest new URLs,
review existing URLs, create site policies, and view crawl stats and
samples for any set. On the output anyone will be able to grab the
latest cached copies of individual URLs, get entire snapshots/sets as
they happen, or even build custom jobs to filter through and grab
copies of just what they need. I’ll take some time to get all this
together of course )

Jer

Advertisements

First ticket

August 30, 2007

Tracs got it’s first use.  Ticket number one is now dedicated to a compression error.

Tracs for reporting your errors with Grub.  If you want to start a new ticket for an error then follow this link.

More soon.

Mark

Grub Developments

August 28, 2007

Well it was announced on the mailing list today that there is now a bug tracker available for Grub at http://dev.grub.org (runs Trac) and also a svn repository at http://dev.grub.org/svn.

It was also announced that theres a new mailing list for Grub development which can be found here.

Jer also announced to every one that hes off for the week so new developments may be slow. However if you want to blog on here you can contact me for access. Ways to do so are here.
Mark

Grub Stats

August 22, 2007

Windows Screensaver Mode
Well Grub’s now got over 2,500 users with over 300 active this month.  Everyone helping to index the web and fill up the servers drives.  Releases of the data are expected in the near future for analysis by the many users and developers interested in the project.

To download Grub and help with the project follow this link and download the correct version for your operating system (Either Linux or Window). 

More soon.

 Mark

Grub update

August 18, 2007

Many people are now using the Grub crawler and results are being fed back to the main server database to be stored. Stats on the amount of crawling can be seen here.  However some users are saying in forum posts (here) that the server is no longer distributing work units.   Are you experiencing the same?  Leave a message in the forum.

On the same note I am hoping that bug tracking will be up soon.

More soon.  Mark