I have seen the future.

wget (http://www.gnu.org/software/wget/) has a recursive archival mode that fixes up links; you might try that. Off the top of my head,

$ wget -rkp --limit 99 http://www.example.com/wiki/root/page

may be close to what you want (recurse, convert links for local viewing, include page prereqs, recursion limit really high).

"".join(buffers) is kind of naïve but not really any worse than any other approach. Reading the source, it looks like the author may not be familiar with the python garbage collector; I *think* what ends up happening is that every page in the wiki ends up staying in memory, rather than being read, written to disk, and forgotten. You might try adding

del doc

at line 631 (after f.close()) to force the GC.

From:

In fact, I *know* it's not doing any garbage collection. I just don't know how to fix that.

Trying the del doc thing now....

From:

Looks good so far - memory isn't blatantly leaking. I'll now if it worked right in about 15 minutes.

From:

Whoops, no good. Same error.

Interesting side effect: The script, mid-dying, drops a 2.0GB file called "zi[PILE OF CHARACTERS HERE" in the running directory.

From:

Buggery and heck. Look me up on your IM of choice - angrybaldguy@gmail.com on MSN and google, derspiny on AIM, and 23778918 on ICQ.

If I can get access to the wiki (and thus repro this myself) it will be useful.

From:

the-trav.livejournal.com

From a brief glance, the globals url_filename_cache and wrote_file_set will continue to grow through the execution, however they should only hold relatively short strings, that shouldn't be a problem for most cases though you may want to try replacing them with something that performs the calculation every time.

From:

the-trav.livejournal.com

also line 316 appears to open a url handle that never gets closed, does python automagically close those on method exit?

From:

If I remember right, it used to be (2.4ish era) that GC was run at a handful of times:
- when an object's reference count decreased, that object was eligible for immediate GC (which provides deterministic finalization for most objects, since they're only visible within some stack frame)
- every N GCs, a sweep GC is done to collect cycles.

I can't imagine Guido &c breaking programs that relied on the deterministic finalization semantics, since it's been part of Python since forever and a lot of early programs rely on it.

What I DON'T remember is whether assigning a new value to a variable immediately attempts to GC the old object or not. I'd expect it to be immediate, but I'd have to check.

My plan is to reprop the problem inside pdb and see what objects are huge. :)

From:

...now that I've actually *looked*: yes, f should be automatically closed and discarded as soon as the method returns, since f is a local symbol and is the only reference to the object it's pointed at.

From:

Access to the wiki is difficult to swing - I could give it to you easily enough, but it's full of proprietary client data, so I could also get sued for giving you access.

From:

PS: Wget is no good, because Mediawiki generates trillions of teeny little links to edit pages and internal sections and histories and the like. I don't want any of those. I want a single HTML copy of each content page, with no history, no edits, etc.

From:

If you generate a sitemap you can gank the url list, then grep -v the history/edit patterned url's out of it

From:

wget -m http://wiki.site.tld/

:)

From:

Have you ever *tried* that on a Wiki?

Go ahead, I'll wait.

From:

Just under 10 minutes for the one we maintain at work. Helps that I can run it locally of course.

From:

And did you, like me, get copies of the edit pages and the diff pages and the full histories of all the pages?

Because that's what I get. And while the wiki I'm creating an offline copy of isn't Wikipedia, it's still about 500MB when *all* you have is the actual content, no counting the many thousands of copies of the content you'll make when you start pulling the revisions in.

From:

Yep, I did.

Another alternative to avoid that, is to generate a list of URL's and use the --input-file= option to restrict what wget goes after rather than using -m to mirror it.

From:

mhoye.livejournal.com

Use the --reject option to avoid downloading parts you don't want. reject Talk:, action=edit, etc.

From:

That requires me to go through and figure out a whole lot of expressions that will get all the pages I don't want without matching any of the pages I do want.

Whereas the script I already have does all of this for me, and just runs out of memory because it's doing *something* screwy with not handling the pages individually. So, figuring that would is hopefully going to be less work.

From:

mhoye.livejournal.com

I bet that if you use just Talk: and "&action=" you'll get what you want.

From: