theweaselking: (Work now)
[personal profile] theweaselking
I've got a python script, from here, that turnes a MediaWiki installation into a static HTML website - suitable for giving you an offline snapshot of your wiki!

However, the script, as written, uses a metric crapload of memory if your wiki is big. As in, runs OUT of memory if your wiki is too big.

The trackback is:
Traceback (most recent call last):
File "mw2html.py", line 742, in
main()
File "mw2html.py", line 738, in main
run(config)
File "mw2html.py", line 591, in run
doc = f.read()
File "/usr/lib/python2.5/socket.py", line 295, in read
return "".join(buffers)
MemoryError
Anyone know enough Python to tell me if there's a way to do it effectively in smaller chunks?

(no subject)

Date: 2009-01-29 03:58 pm (UTC)
From: [identity profile] elffin.livejournal.com
Alternately, how feasible would it be to, say - open every page of the wiki in a separate browser tab (or browser history)? Firefox and/or Opera do better memory management than python hacks and with a macro script automating the process ...

(no subject)

Date: 2009-01-29 06:10 pm (UTC)
From: [identity profile] ungratefulninja.livejournal.com
wget (http://www.gnu.org/software/wget/) has a recursive archival mode that fixes up links; you might try that. Off the top of my head,

$ wget -rkp --limit 99 http://www.example.com/wiki/root/page

may be close to what you want (recurse, convert links for local viewing, include page prereqs, recursion limit really high).

"".join(buffers) is kind of naïve but not really any worse than any other approach. Reading the source, it looks like the author may not be familiar with the python garbage collector; I *think* what ends up happening is that every page in the wiki ends up staying in memory, rather than being read, written to disk, and forgotten. You might try adding

del doc

at line 631 (after f.close()) to force the GC.

(no subject)

Date: 2009-01-29 06:17 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
In fact, I *know* it's not doing any garbage collection. I just don't know how to fix that.

Trying the del doc thing now....

(no subject)

Date: 2009-01-29 06:20 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Looks good so far - memory isn't blatantly leaking. I'll now if it worked right in about 15 minutes.

(no subject)

Date: 2009-01-29 06:29 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Whoops, no good. Same error.

Interesting side effect: The script, mid-dying, drops a 2.0GB file called "zi[PILE OF CHARACTERS HERE" in the running directory.

(no subject)

Date: 2009-01-29 11:39 pm (UTC)
From: [identity profile] ungratefulninja.livejournal.com
Buggery and heck. Look me up on your IM of choice - angrybaldguy@gmail.com on MSN and google, derspiny on AIM, and 23778918 on ICQ.

If I can get access to the wiki (and thus repro this myself) it will be useful.

(no subject)

Date: 2009-01-30 12:45 am (UTC)
From: [identity profile] the-trav.livejournal.com
From a brief glance, the globals url_filename_cache and wrote_file_set will continue to grow through the execution, however they should only hold relatively short strings, that shouldn't be a problem for most cases though you may want to try replacing them with something that performs the calculation every time.

(no subject)

Date: 2009-01-30 12:48 am (UTC)
From: [identity profile] the-trav.livejournal.com
also line 316 appears to open a url handle that never gets closed, does python automagically close those on method exit?

(no subject)

Date: 2009-01-30 01:05 am (UTC)
From: [identity profile] ungratefulninja.livejournal.com
If I remember right, it used to be (2.4ish era) that GC was run at a handful of times:
- when an object's reference count decreased, that object was eligible for immediate GC (which provides deterministic finalization for most objects, since they're only visible within some stack frame)
- every N GCs, a sweep GC is done to collect cycles.

I can't imagine Guido &c breaking programs that relied on the deterministic finalization semantics, since it's been part of Python since forever and a lot of early programs rely on it.

What I DON'T remember is whether assigning a new value to a variable immediately attempts to GC the old object or not. I'd expect it to be immediate, but I'd have to check.

My plan is to reprop the problem inside pdb and see what objects are huge. :)

(no subject)

Date: 2009-01-30 01:11 am (UTC)
From: [identity profile] ungratefulninja.livejournal.com
...now that I've actually *looked*: yes, f should be automatically closed and discarded as soon as the method returns, since f is a local symbol and is the only reference to the object it's pointed at.

(no subject)

Date: 2009-01-30 01:56 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Access to the wiki is difficult to swing - I could give it to you easily enough, but it's full of proprietary client data, so I could also get sued for giving you access.

(no subject)

Date: 2009-01-29 06:50 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
PS: Wget is no good, because Mediawiki generates trillions of teeny little links to edit pages and internal sections and histories and the like. I don't want any of those. I want a single HTML copy of each content page, with no history, no edits, etc.

(no subject)

Date: 2009-01-29 07:06 pm (UTC)
From: [identity profile] zastrazzi.livejournal.com
If you generate a sitemap you can gank the url list, then grep -v the history/edit patterned url's out of it

(no subject)

Date: 2009-01-29 06:21 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Have you ever *tried* that on a Wiki?

Go ahead, I'll wait.

(no subject)

Date: 2009-01-29 06:45 pm (UTC)
From: [identity profile] zastrazzi.livejournal.com
Just under 10 minutes for the one we maintain at work. Helps that I can run it locally of course.

(no subject)

Date: 2009-01-29 06:53 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
And did you, like me, get copies of the edit pages and the diff pages and the full histories of all the pages?

Because that's what I get. And while the wiki I'm creating an offline copy of isn't Wikipedia, it's still about 500MB when *all* you have is the actual content, no counting the many thousands of copies of the content you'll make when you start pulling the revisions in.

(no subject)

Date: 2009-01-29 07:05 pm (UTC)
From: [identity profile] zastrazzi.livejournal.com
Yep, I did.

Another alternative to avoid that, is to generate a list of URL's and use the --input-file= option to restrict what wget goes after rather than using -m to mirror it.

(no subject)

Date: 2009-01-29 08:33 pm (UTC)
From: [identity profile] mhoye.livejournal.com
Use the --reject option to avoid downloading parts you don't want. reject Talk:, action=edit, etc.

(no subject)

Date: 2009-01-29 08:40 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
That requires me to go through and figure out a whole lot of expressions that will get all the pages I don't want without matching any of the pages I do want.

Whereas the script I already have does all of this for me, and just runs out of memory because it's doing *something* screwy with not handling the pages individually. So, figuring that would is hopefully going to be less work.

(no subject)

Date: 2009-01-29 09:23 pm (UTC)
From: [identity profile] mhoye.livejournal.com
I bet that if you use just Talk: and "&action=" you'll get what you want.

(no subject)

Date: 2009-01-30 01:59 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
And &diff and &oldid and those are just the first two I found on the very first page I looked at.

Profile

theweaselking: (Default)theweaselking
Page generated Feb. 5th, 2026 10:18 am