Thursday, October 06, 2005

mw2html -- Export Mediawiki to static, traditional website

mw2html is a Python program which exports a Mediawiki site to a traditional-looking HTML site. No indices are generated, and Wikipedia sidebars and extra formatting are removed to give a simple, streamlined site (you can substitute in your own sidebars if you wish). The outputted HTML source code is Monobook-specific and rather verbose at the moment. But the sites thus exported don't look bad. (Source code).

47 Comments:

Anonymous Donato said...

please add more documentation, I want to use your script, im not good at python and PHP. but iM an experienced programmer (java/asp)so I know the concept. please help.

Many Thanks!

7:10 AM  
Blogger Connelly said...

First, install Python from [1]. Next, verify that Python is in your system shell's PATH by typing python *enter* from the shell. If successful, this should place you in a Python interactive interpreter. If this fails, modify the PATH environment variable to include the Python binary directory (on Windows this is C:\Python2.x\).

Next, save mw2html.py somewhere. Install your copy of Mediawiki [2]. Now at the command line, type (on one line, without the % character)

% python mw2html.py http://yourwikiurl.com/ outdir

Browse the output directory to see the generated HTML. You can use python mw2html.py with no arguments to get a list of available command line options.

This advice is pretty general; if you have a more specific problem let me know.

9:55 PM  
Anonymous Andy Davidson said...

I just tried mw2html.py to back up a fairly new mediawiki. I ran 'python mw2html.py http://127/0/0/1/wiki /tmp/wiki'

File "mw2html.py", line 218

SyntaxError: EOL while scanning single-quoted string

I can't include the call (doc.replace) because blogger thinks I am trying to use server side scripting.

Do yo have a spare clue as to what I might be doing wrong? If you need more info, let me know and I can send email.

4:30 PM  
Anonymous Andy Davidson said...

Never mind. I found the proble: I had Privoxy running and it replaced some open commands. Ugh.

I also had to remove an assertion at line 265, but it now runs to completion.

3:18 PM  
Blogger Forrest O. said...

ImportError: No module named textwrap

Any hints on how I might be able to get around that?

1:52 PM  
Anonymous Anonymous said...

This worked perfectly for me, where many other methods had failed. Thank you very much.

8:16 AM  
Anonymous Michael Kennedy said...

Is there a way for this to access password protected wikis? I'd like to grab an output of the one I use for work clients that I have passworded so I can store it on my key for those days I don't have easy web access. Thanks.

1:05 AM  
Anonymous Anonymous said...

It works very well.

I am modifying your code to remove the footer entirely and make it suitable for generating software documentation and such, from wiki.

However, I have a little problem. The script freezes for minutes right after the start, and at some files during generation of output directory. Task Manager shows no activity. Please let me know if there are any workaround.

8:47 AM  
Anonymous Anonymous said...

Hi,

I'm a newbie. Installed Python. Its in my path, but python *enter* appears to be invalid...and so are the other commands...CAn you help?

2:02 AM  
Anonymous Anonymous said...

Hi

I tried that html-exporter on a fresh mediawiki 2.6.8 installation but there is a weird problem:
When I start the script, the whole network connection goes down. In another shell I started a persistent ping to some host and as soon as i start the exporter i get ping timeouts and can't surf on the net anymore. If I cancel the exporter it works again. My network monitor shows 0% net activity though.

After a _long_ time the html was in the outdir but bad formatted because it couldn't fetch some files, like:

Error opening: http://censoredurl/skins/monobook/main.css?7

Error opening: http://censoredurl/skins/common/IEFixes.js

Those are accessible via browser without any problems.

My System:
WinXP SP2, Python 2.4, htmldata 1.1.0, mediawiki 2.6.8, no proxy, wiki is not locally installed

Any ideas?

6:32 AM  
Anonymous Niels de Vos said...

Hi Connely, thanks for the really cool script! Unfortunately there seems to be a little error(?) in the function parse_css. In one of my .css files there is a reference to a w3c.com-page. Here's a patch to fix this, just like the parse_html function does.

Cheers!

--- mw2html.py.orig 2006-09-03 15:04:21.000000000 +0200
+++ mw2html.py 2006-09-03 14:55:53.000000000 +0200
@@ -480,10 +480,13 @@

L = htmldata.urlextract(doc, url, 'text/css')
for item in L:
- # Store url locally.
u = item.url
- new_urls += [u]
- item.url = url_to_relative(u, url, config)
+ if should_follow(url, u):
+ # Store url locally.
+ new_urls += [u]
+ item.url = url_to_relative(u, url, config)
+ else:
+ item.url = rewrite_external_url(item.url, config)

newdoc = htmldata.urljoin(doc, L)
newdoc = post_css_transform(newdoc, url, config)

6:23 AM  
Anonymous Anonymous said...

When I run command :
python mw2html.py http://localhost/mediawiki/ ~/wiki1

And open main_page.html, however, the sidebar of wiki was lost so if I browse to child article, I cannot back to main page. Could you guide me how to export sidebar of wiki into sidebar.html
Thank you

2:03 AM  
Anonymous Anonymous said...

Hello,

Thanks to your post
about using mw2html to export mediawiki to html :) it is such a great
help to me.

However, I encounter a problem of how to export the left menu of wiki in
each html page exported. Therefore, it is hard to navigate from each
html page to others.

(Left side menu includes:
* Main Page
* Community portal
* Current events
* Recent changes
* Random page
* Help
* Donations
...
)

I try to use the command with option -l to export left side menu bar,
but I fail. so could you help me how to export left side menu bar of
wiki in each html page

Thank you very much and looking forward to hearing from you.

5:30 PM  
Blogger tam said...

Hello,

Thanks to your post

about using mw2html to export mediawiki to html :) it is such a great
help to me.

However, I encounter a problem of how to export the left menu of wiki in
each html page exported. Therefore, it is hard to navigate from each
html page to others.

(Left side menu includes:
* Main Page
* Community portal
* Current events
* Recent changes
* Random page
* Help
* Donations
...
)

I try to use the command with option -l to export left side menu bar,
but I fail. so could you help me how to export left side menu bar of
wiki in each html page

Thank you very much and looking forward to hearing from you.

5:32 PM  
Blogger Connelly said...

With the left sidebar option you pass an HTML filename as an argument, and that file's contents are pasted into the "left sidebar" region. Thus you can make a custom left sidebar HTML file with whichever links you desire.

11:15 PM  
Anonymous Anonymous said...

Is there any way to restrict the link-following depth, along the lines of wget? Or better yet, to prevent any link following outside of a given URL?

For example, we have lots of stuff on

oursite.com/wiki
that we want to save,

and stuff on
oursite.com/
that we don't.

Thanks for providing a very useful piece of software!

10:33 AM  
Anonymous Anonymous said...

Superb work, solved a big headache. Easy to install worked without a hitch.

3:45 AM  
Blogger Marija said...

fantastic - everything worked fine from the start

Many thanks for the great work !

3:04 AM  
Anonymous Vlad said...

I was sure that I need to write my own WikiMedia converter, before I find this one.

Thanks a lot!

3:53 AM  
Blogger Dave said...

I have a mediawiki installed on my local intranet but I can't seem to get the script working. It always stalls out and gives an error. Is there a problem being behind a proxy? I tried removing all of the external links in case that was the problem, but I can't seem to get the script to finish. Any tips? The server is running IIS 6.0 w/ MySQL.

9:29 AM  
Anonymous Kryten said...

Hey man,

This hit the spot for my needs.

I was looking for a way to build a static copy of an internal wiki - and this tool got the job done. My MacBook running Leopard already had Python - so it was almost zero installation to use.

Good stuff, thanks for posting it...

2:46 PM  
Blogger Dan said...

Hello Connelly,
First, thank you for creating this tool. I hope you are still available to answer questions about it. I am running mw2html on the linux box that hosts my wiki. It runs great, however after it has processed what I think is a fair chunk of the pages, I get an error. I'm not a Python user so I'm at a loss as to how to proceed. The traceback I get is as follows:

Traceback (most recent call last):
File "mw2html.py", line 742, in ?
main()
File "mw2html.py", line 738, in main
run(config)
File "mw2html.py", line 600, in run
(doc, new_urls) = parse_css(doc, url, config)
File "mw2html.py", line 489, in parse_css
newdoc = post_css_transform(newdoc, url, config)
File "mw2html.py", line 286, in post_css_transform
doc = monobook_hack_skin_css(doc, url, config)
File "mw2html.py", line 265, in monobook_hack_skin_css
assert c1 in doc
AssertionError

Is there any way of knowing which page it may have been processing when it hit this error?

Thanks in advance,
Dan

8:21 AM  
Anonymous Anonymous said...

Since this is still a top Google hit for making a static copy of a MW, here's my 2¢. Run the command with no parameters to see the options and try them! I had more success using the disable skin hack option.

11:11 AM  
Anonymous Ron said...

Thank you for the wonderful script. My personal wiki is protected by http authentication. This fix allows you to download your wiki even when it is protected in such a way:

(add this code just before the 'while' statement in the run function)

password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
username = raw_input("Username: ")
password = raw_input("Password: ")
password_mgr.add_password(None, config.rooturl, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
urllib2.install_opener(opener)

4:42 AM  
Blogger Domos said...

Thanks, for this exelent program, it is what i was looking for.

1:47 PM  
Anonymous carlo said...

Italy, 26 June 2009 (just to show the date!)

I found your wonderful script after so much searching in Internet. It solves the problem to convert mediawiki to html very elegantly. My compliments!
By the way: I have found a (dirty) trick to convert a section only of the wiki: the interesting pages only are set to a determined Category. Then I "export by category" (there are nice extensions to do this) to a void wiki server, by means of a XML export file. From the filled wiki server, now with only the category of interest, I extract all with your script.

Carlo

10:40 AM  
Anonymous Fary said...

This worked perfectly for me, where many other methods had failed. Thank you very much.

10:10 AM  
Anonymous Anonymous said...

Thank you Connelly for creating this tool.
Duka, Sep.12, 2009

7:02 PM  
Anonymous Anonymous said...

If I try the script with http://wikitravel.org it seems to stall and no output directory is created.

5:57 PM  
Blogger Jon Drews said...

Thanks for making this useful tool. Just used it without any problems. Thanks!

12:33 PM  
Blogger esor_ekim said...

Hello,
Nice script, easy to use.
The wiki I've run the script against is part of a website and there are links from the wiki to other parts of the website. mw2html followed the links and proceeded to create copies of those as well. I do not see an option to stop this behaviour - am I missing something?
thanks,
Mike.

5:08 AM  
Blogger esor_ekim said...

OK, found a fix:

Changed:

if 'MediaWiki:' in url or 'Special:' in url:

to:

if 'MediaWiki:' in url or 'Special:' in url or rooturl not in url:

7:56 AM  
Anonymous Viagra Online said...

I was working on a project the other night and I wanted to export a Mediawiki site to a traditional looking HTML site, but I did not know how to do it till now, thank you!

3:40 PM  
Anonymous Anonymous said...

Great job ! Just the tool I needed : simple and efficient. Gratz !

4:36 AM  
Anonymous تقنية said...

this called tutorial thanks it was very helpful

8:43 AM  
Blogger cute-cute said...

Really love all the posts you offer! I am so looking forward to seeing more like them…..
Thanks

Asian Stationary

7:16 AM  
Blogger Don Hancock said...

Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading more on this topic.

grants from the government

homepage ersteller
_________________________________________________________________________________________

Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading more on this topic.

Work At Home Jobs That Work
Jobs For Working From Home
Jobs That Work From Home
Job To Work At Home


great blog and post, i feel great reading about this topic, keep it up!

Jobs For Work At Home
Plans For Woodworking
Woodworkers Plans
TV By Internet
Look At TV Online
TV From Internet
Grants from the Government
Horde Levelling Areas
WoW Levelling
Quest Helpers For WoW

3:06 AM  
Blogger Don Hancock said...

Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading more on this topic.

grants from the government

homepage ersteller
_________________________________________________________________________________________

Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading more on this topic.

Work At Home Jobs That Work
Jobs For Working From Home
Jobs That Work From Home
Job To Work At Home


great blog and post, i feel great reading about this topic, keep it up!

Jobs For Work At Home
Plans For Woodworking
Woodworkers Plans
TV By Internet
Look At TV Online
TV From Internet
Grants from the Government
Horde Levelling Areas
WoW Levelling
Quest Helpers For WoW

3:06 AM  
Blogger Marco said...

After downloading your scrips, a module htmldata is requested but the given URL to obtain the module does not exist. Where can I find this module?

Thanks for help.

5:15 AM  
Blogger Marco said...

After some googling I found http://www.connellybarnes.com/code/htmldata/htmldata which may be the correct module. :)

5:22 AM  
Blogger Kelvinator said...

I saved the file and in using Python 2.6, I get the following:
>>> python mw2html.py http://xmswiki.com/xms/Main_Page

SyntaxError: invalid syntax

and the file name of "mw2html" is highlighted. So what else do I need to do?

9:55 AM  
Blogger Darnell said...

It has been a while since you created this, but your script saved me a lot of work. Thanks Connelly!

11:52 AM  
Blogger Jane said...

I was interested know about it lesbian bondage sex

7:07 AM  
Blogger Jane said...

I really liked your article. Keep up the good work.I love extreme bondage sex

12:34 AM  
Blogger Morten Slott Hansen said...

Was just about to do a manualt export of a wiki site when Google pointed me to this site. This is awesome stuff and I do not have to tweak wget or my other mirroring tools ;)

3:23 AM  
Blogger Nishant Shetty said...

Install Python 2.7 and not the 3.3 version, This script works well on older python versions. And even the htmldata error doesn't comes. Though I downloaded the htmldata script from http://www.connellybarnes.com/code/htmldata/htmldata-1.1.1 and saved it as htmldata.py file in the Pythons' Tools\Script folder.

2:20 AM  
Blogger Nishant Shetty said...

Use Python 2.7 or older versions it worked for me on that, the latest 3.3 didn't worked. Also I am not sure whether the htmldata.py module is required for older versions though I downloaded it from http://www.connellybarnes.com/code/htmldata/htmldata-1.1.1 and saved as htmldata.py in C:\Python27\Tools\Scripts and C:\Python27\Lib folder (i m using windows for development). Hope this helps..!

2:23 AM  

Post a Comment

<< Home