Any Python coders out there?
Jan. 29th, 2009 10:30 amI've got a python script, from here, that turnes a MediaWiki installation into a static HTML website - suitable for giving you an offline snapshot of your wiki!
However, the script, as written, uses a metric crapload of memory if your wiki is big. As in, runs OUT of memory if your wiki is too big.
The trackback is:
However, the script, as written, uses a metric crapload of memory if your wiki is big. As in, runs OUT of memory if your wiki is too big.
The trackback is:
Traceback (most recent call last):Anyone know enough Python to tell me if there's a way to do it effectively in smaller chunks?
File "mw2html.py", line 742, in
main()
File "mw2html.py", line 738, in main
run(config)
File "mw2html.py", line 591, in run
doc = f.read()
File "/usr/lib/python2.5/socket.py", line 295, in read
return "".join(buffers)
MemoryError
(no subject)
Date: 2009-01-29 03:58 pm (UTC)(no subject)
Date: 2009-01-29 06:10 pm (UTC)$ wget -rkp --limit 99 http://www.example.com/wiki/root/page
may be close to what you want (recurse, convert links for local viewing, include page prereqs, recursion limit really high).
"".join(buffers) is kind of naïve but not really any worse than any other approach. Reading the source, it looks like the author may not be familiar with the python garbage collector; I *think* what ends up happening is that every page in the wiki ends up staying in memory, rather than being read, written to disk, and forgotten. You might try adding
del doc
at line 631 (after f.close()) to force the GC.
(no subject)
Date: 2009-01-29 06:17 pm (UTC)Trying the del doc thing now....
(no subject)
Date: 2009-01-29 06:20 pm (UTC)(no subject)
Date: 2009-01-29 06:29 pm (UTC)Interesting side effect: The script, mid-dying, drops a 2.0GB file called "zi[PILE OF CHARACTERS HERE" in the running directory.
(no subject)
Date: 2009-01-29 11:39 pm (UTC)If I can get access to the wiki (and thus repro this myself) it will be useful.
(no subject)
Date: 2009-01-30 12:45 am (UTC)(no subject)
Date: 2009-01-30 12:48 am (UTC)(no subject)
Date: 2009-01-30 01:05 am (UTC)- when an object's reference count decreased, that object was eligible for immediate GC (which provides deterministic finalization for most objects, since they're only visible within some stack frame)
- every N GCs, a sweep GC is done to collect cycles.
I can't imagine Guido &c breaking programs that relied on the deterministic finalization semantics, since it's been part of Python since forever and a lot of early programs rely on it.
What I DON'T remember is whether assigning a new value to a variable immediately attempts to GC the old object or not. I'd expect it to be immediate, but I'd have to check.
My plan is to reprop the problem inside pdb and see what objects are huge. :)
(no subject)
Date: 2009-01-30 01:11 am (UTC)(no subject)
Date: 2009-01-30 01:56 pm (UTC)(no subject)
Date: 2009-01-29 06:50 pm (UTC)(no subject)
Date: 2009-01-29 07:06 pm (UTC)(no subject)
Date: 2009-01-29 06:14 pm (UTC):)
(no subject)
Date: 2009-01-29 06:21 pm (UTC)Go ahead, I'll wait.
(no subject)
Date: 2009-01-29 06:45 pm (UTC)(no subject)
Date: 2009-01-29 06:53 pm (UTC)Because that's what I get. And while the wiki I'm creating an offline copy of isn't Wikipedia, it's still about 500MB when *all* you have is the actual content, no counting the many thousands of copies of the content you'll make when you start pulling the revisions in.
(no subject)
Date: 2009-01-29 07:05 pm (UTC)Another alternative to avoid that, is to generate a list of URL's and use the --input-file= option to restrict what wget goes after rather than using -m to mirror it.
(no subject)
Date: 2009-01-29 08:33 pm (UTC)(no subject)
Date: 2009-01-29 08:40 pm (UTC)Whereas the script I already have does all of this for me, and just runs out of memory because it's doing *something* screwy with not handling the pages individually. So, figuring that would is hopefully going to be less work.
(no subject)
Date: 2009-01-29 09:23 pm (UTC)(no subject)
Date: 2009-01-30 01:59 pm (UTC)