Mercurial > public > html2wiki
annotate src/org/nwoca/ssdt/tools/html2wiki/Html2Wiki.java @ 14:c8442e0eff84
Remove <caption> tags. Generlized {table} around {code} blocks.
author | smith@nwoca.org |
---|---|
date | Tue, 01 Feb 2011 12:34:45 -0500 |
parents | cf58f4b9902b |
children | 494ca5643e1a |
rev | line source |
---|---|
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
1 package org.nwoca.ssdt.tools.html2wiki; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
2 /* |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
3 * Html2Wiki.java |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
4 * |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
5 * Created on May 9, 2006, 3:22 PM |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
6 * |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
7 */ |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
8 |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
9 import java.io.*; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
10 import java.util.Collection; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
11 import java.util.ArrayList; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
12 import java.util.List; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
13 import org.apache.commons.io.FileUtils; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
14 import java.util.regex.*; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
15 |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
16 /** |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
17 * Converter to convert HTML documents into MediaWiki test. |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
18 * |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
19 * Heavily customized to handle HTML produced by DEC DOCUMENT |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
20 * SOFTARE doctype. Breaks file into Chapters in the manner done |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
21 * by Document. Needs modification to work with other HTML files. |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
22 * |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
23 * @author SMITH |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
24 */ |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
25 public class Html2Wiki { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
26 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
27 private StringBuffer buffer; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
28 private Collection<Transformer> transformers; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
29 private boolean converted = false; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
30 private static String category; |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
31 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
32 /** Creates a new instance of Html2Wiki. */ |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
33 public Html2Wiki(String html) { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
34 buffer = new StringBuffer(html); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
35 transformers = new ArrayList<Transformer>(); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
36 transformers.add(new DeleteTransformer("<html>|</html>|<body>|</body>")); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
37 transformers.add(new DeleteTransformer("<!--.*-->(\\n|\\r)*",true)); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
38 transformers.add(new DeleteTransformer("<a .*?>|</a>")); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
39 transformers.add(new DeleteTransformer("(?m)^\\*")); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
40 transformers.add(new DeleteTransformer("(?m)<br>$")); |
14
c8442e0eff84
Remove <caption> tags. Generlized {table} around {code} blocks.
smith@nwoca.org
parents:
13
diff
changeset
|
41 transformers.add(new DeleteTransformer("<caption>.*</caption>")); // remove SDML captions (used for TOC) |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
42 transformers.add(new DeleteTransformer("<font .*?>|</font>")); |
4
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
43 transformers.add(new CloseTagTransformer("<li>","(\n|\r)*(<li>|</ul>|</ol>|<ul>|<ol>)","</li>")); |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
44 transformers.add(new BadTableDataTransformer()); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
45 transformers.add(new BadTableRowTransformer()); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
46 transformers.add(new ReflowTransformer()); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
47 transformers.add(new DeleteTransformer("<p>")); |
8 | 48 transformers.add(new ReplaceTransformer("\\{","\\{")); // Escape braces |
49 transformers.add(new ReplaceTransformer("\\}","\\}")); | |
7
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
50 |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
51 transformers.add(new ReplaceTransformer("\\[","\\[")); // Escape brackets |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
52 transformers.add(new ReplaceTransformer("\\]","\\]")); |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
53 transformers.add(new PreTagTransformer()); // Unescape brackets inside <pre> |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
54 // |
4
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
55 transformers.add(new ReplaceTransformer("<br>","\\\\")); |
8 | 56 |
57 //replace table tag preserving border setting. | |
10 | 58 transformers.add(new TagTransformer("<table\\sborder=(\\d).*?>", true, "{table:border=", "|width=75%}")); |
8 | 59 |
4
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
60 transformers.add(new ReplaceTransformer("<table.*?>|</table>","{table}")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
61 transformers.add(new ReplaceTransformer("<tr>|</tr>","{tr}")); |
5 | 62 transformers.add(new ReplaceTransformer("<td.*?>|</td>","{td}")); |
63 transformers.add(new ReplaceTransformer("<th.*?>|</th>","{th}")); | |
4
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
64 transformers.add(new ReplaceTransformer("<ol.*?>|</ol>","{ol}")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
65 transformers.add(new ReplaceTransformer("<ul.*?>|</ul>","{ul}")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
66 transformers.add(new ReplaceTransformer("<li>","{li}")); |
13 | 67 transformers.add(new ReplaceTransformer("\\n\\s*</li>","{li}\n")); // remove leading space from </li> |
68 transformers.add(new ReplaceTransformer("</li>","{li}\n")); // Replace remaining </li> | |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
69 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
70 transformers.add(new ChapterTransformer(category)); |
4
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
71 transformers.add(new TagTransformer("<pre>(.*?)</pre>", true, "{code}","{code}")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
72 transformers.add(new TagTransformer("<center>(.*?)</center>", true, "{center}","{center}")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
73 transformers.add(new TagTransformer("<em>(.*?)</em>", "*","*")); |
12 | 74 transformers.add(new TagTransformer("<strong>(.*?)</strong>", true, "*","*")); |
9 | 75 transformers.add(new TagTransformer("<u>(.*?)</u>" , "+","+")); |
4
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
76 transformers.add(new TagTransformer("(?s)<kbd>(.*?)</kbd>", "{{", "}}")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
77 transformers.add(new TagTransformer("<h1>(.*)</h1>", "h1. ", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
78 transformers.add(new TagTransformer("<h2>(.*)</h2>", "h2. ", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
79 transformers.add(new TagTransformer("<h3>(accessing the program|sample run|sample screens?|sample reports?)</[h|H]3>","h3.", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
80 transformers.add(new TagTransformer("<h3>(.*)</H3>", "h3. ", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
81 transformers.add(new TagTransformer("<h3>(.*)</h3>", "h3. ", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
82 transformers.add(new TagTransformer("<h4>(.*)</h4>", "h4. ", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
83 transformers.add(new TagTransformer("<h5>(.*)</h5>", "h5. ", "")); |
22ed6d93442c
Start modifying transformers to Confluence wiki syntax
smith@nwoca.org
parents:
2
diff
changeset
|
84 transformers.add(new TagTransformer("<h6>(.*)</h6>", "h6. ", "")); |
8 | 85 |
86 //Replace Notes with Info tags. | |
10 | 87 transformers.add(new ReplaceTransformer("\\{center}\\n\\{table:border=\\d.*}\\n\\{tr\\}\\n\\s{2}\\{td\\}\\{center\\}\\*Note\\*\\{center\\}","{info}")); |
8 | 88 transformers.add(new ReplaceTransformer("\\{td\\}\\n\\s{2}\\{tr\\}\\n\\{table\\}\\n\\{center\\}","{info}")); |
5 | 89 |
8 | 90 //Remove unnecessary table surrounding code blocks. |
14
c8442e0eff84
Remove <caption> tags. Generlized {table} around {code} blocks.
smith@nwoca.org
parents:
13
diff
changeset
|
91 transformers.add(new ReplaceTransformer("\\{table:.*\\}(\\n|\\s|\\{t.\\}|\\*\\S*\\*)*\\{code\\}","{code}")); |
c8442e0eff84
Remove <caption> tags. Generlized {table} around {code} blocks.
smith@nwoca.org
parents:
13
diff
changeset
|
92 transformers.add(new ReplaceTransformer("\\{code\\}(\\n|\\{t.\\}|\\s)*\\{table\\}","{code}")); |
8 | 93 |
94 //Change borderStyle of code window for "screenshots" to none. | |
95 transformers.add(new TagTransformer("\\{code\\}([\\s\\n]*?_______________)", true, "{code:borderStyle=none}", "")); | |
96 | |
97 | |
98 | |
7
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
99 transformers.add(new TagTransformer("<blockquote>(.*?)</blockquote>", true, "{quote}", "{quote}")); |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
100 transformers.add(new DeleteTransformer("(?s)<hr.*?>")); |
8 | 101 transformers.add(new ReflowTransformer("(\\{info\\})([^\\{]*)(\\{info\\})")); |
12 | 102 transformers.add(new ReflowTransformer("(\\{note\\})([^\\{]*)(\\{note\\})")); |
103 transformers.add(new ReflowTransformer("(\\{td\\})([^\\{]*)(\\{td\\})")); | |
104 transformers.add(new ReflowTransformer("(\\{li\\})([^\\{]*)(\\{li\\})")); | |
7
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
105 transformers.add(new TagTransformer("<sup>(.*?)</sup>", true, "^\\[","\\]^ ")); |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
106 transformers.add(new ReplaceTransformer("<","<")); |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
107 transformers.add(new ReplaceTransformer(">",">")); |
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
108 transformers.add(new ReplaceTransformer(""","\"")); |
12 | 109 transformers.add(new ReplaceTransformer("&","&")); |
7
a634b4d554d4
Minor fixups >, random smilies :), etc. Fixed blockquote. Handle escaping brackets outside pre tag.
smith@nwoca.org
parents:
6
diff
changeset
|
110 transformers.add(new ReplaceTransformer(":\\)",": )")); // No smilies... |
13 | 111 transformers.add(new ReplaceTransformer("(\\w)(--)(\\w)"," -- ",2)); // avoid strikeout |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
112 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
113 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
114 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
115 /** |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
116 * @param args the command line arguments |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
117 */ |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
118 public static void main(String[] args) throws IOException { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
119 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
120 if (args.length == 0) { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
121 System.out.println("Usage:"); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
122 System.out.println(" Html2Wiki {inputDirectory} [Category]"); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
123 System.out.println(" default is current directory"); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
124 System.out.println(" Processes all *.html files. "); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
125 System.out.println(" Each 'chapter' written to *.wiki"); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
126 return; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
127 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
128 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
129 File inputs = new File(args[0]); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
130 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
131 if (args.length > 1) { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
132 category = args[1]; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
133 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
134 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
135 File[] inputFiles = inputs.listFiles(new HtmlFileFilter()); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
136 for (int i = 0; i < inputFiles.length; i++) { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
137 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
138 process(inputFiles[i]); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
139 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
140 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
141 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
142 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
143 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
144 protected static void process(File input) throws IOException { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
145 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
146 System.out.println(input.getAbsoluteFile()); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
147 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
148 Html2Wiki converter = new Html2Wiki(FileUtils.readFileToString(input, null)); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
149 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
150 WikiChapter[] chapters = converter.getWikiChapters(); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
151 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
152 System.out.format("Writing %d wiki files...\n", chapters.length); |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
153 |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
154 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
155 for (int i = 0; i < chapters.length; i++) { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
156 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
157 FileUtils.writeStringToFile(new File(input.getParent(), |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
158 generateFilename(chapters[i].getChapterName()) + ".wiki"), |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
159 chapters[i].getContents().toString(), |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
160 null); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
161 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
162 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
163 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
164 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
165 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
166 public static String generateFilename(String input) { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
167 return input.replaceAll("\\\\|/|:|\\(|\\)", "-").replace("<br>", ""); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
168 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
169 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
170 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
171 public String getWikiText() { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
172 convert(); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
173 return buffer.toString(); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
174 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
175 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
176 public WikiChapter[] getWikiChapters() { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
177 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
178 convert(); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
179 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
180 List<WikiChapter> chapters = new ArrayList<WikiChapter>(); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
181 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
182 Pattern chapterPat = Pattern.compile("<chapter>"); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
183 Matcher begin = chapterPat.matcher(buffer); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
184 Matcher end = chapterPat.matcher(buffer); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
185 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
186 while (begin.find()) { |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
187 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
188 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
189 end.find(begin.end()); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
190 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
191 Pattern chapterNamePat = Pattern.compile("<chapter>(.*?)</chapter>"); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
192 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
193 Matcher chapterNameMatcher = chapterNamePat.matcher(buffer); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
194 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
195 String chapterName = chapterNameMatcher.find(begin.start()) ? chapterNameMatcher.group(1) : null; |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
196 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
197 CharSequence contents = buffer.subSequence(chapterName == null ? begin.start() : chapterNameMatcher.end(), end.hitEnd() ? buffer.length() : end.start()); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
198 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
199 chapters.add(new WikiChapter(chapterName, contents)); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
200 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
201 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
202 return (WikiChapter[]) chapters.toArray(new WikiChapter[]{}); |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
203 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
204 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
205 private void convert() { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
206 |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
207 if (!converted) { |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
208 for (Transformer t : transformers) { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
209 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
210 System.out.println(".Applying: " + t); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
211 t.apply(buffer); |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
212 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
213 } |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
214 } |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
215 converted = true; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
216 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
217 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
218 private static class HtmlFileFilter implements FileFilter { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
219 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
220 public boolean accept(File pathname) { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
221 return pathname.getName().toLowerCase().matches("^.*\\.html$"); |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
222 } |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
223 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
224 |
2
5da2e67620f9
Upgrade to Ivy configuration and begin clean up of tests. Added FreeBSD license.
smith@nwoca.org
parents:
0
diff
changeset
|
225 protected static class WikiChapter { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
226 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
227 private String chapterName; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
228 private CharSequence contents; |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
229 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
230 public WikiChapter(String chapterName, CharSequence contents) { |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
231 this.chapterName = chapterName.replaceAll("\\\\|/|:|\\(|\\)", "-").replaceAll("\\s+", " ").replaceAll("&", "and"); |
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
232 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
233 this.contents = contents; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
234 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
235 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
236 public String getChapterName() { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
237 return chapterName; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
238 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
239 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
240 public CharSequence getContents() { |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
241 return contents; |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
242 } |
6
99f293bd507f
Add "reflow" transformer to reflow paragraphs, list items, etc.
smith@nwoca.org
parents:
5
diff
changeset
|
243 |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
244 public String toString() { |
2
5da2e67620f9
Upgrade to Ivy configuration and begin clean up of tests. Added FreeBSD license.
smith@nwoca.org
parents:
0
diff
changeset
|
245 return "Chapter: " + chapterName + " Content length: " + contents.length(); |
0
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
246 } |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
247 } |
f8b1ea49d065
Initial version of crude HTML to WikiText converter. Customized for converting HTML files from DEC Document into Wiki markup.
smith@nwoca.org
parents:
diff
changeset
|
248 } |