User:Vegard/Wikitext parsing
From Wikipedia, the free encyclopedia
Reasons for using an LL/LR (context-free) parser:
- Parsing efficiency
- Unambiguity
- Uniform parser/grammar can also help greatly for accessing documents with bots/programs (think DOM)
Don't try to capture all of today's wikitext constructs using a formal grammar. This would be counter-productive. There are bound to be differences anyway. Even if strange things like nesting links within links is possible with the ad-hoc parsing, how many articles actually use it? New parser should be simple and extensible (compare with today's regex hell).
Contents |
[edit] Wiki parser
[edit] Recursive descent parser
/* XXX */
document { section* | paragraph* }
/* Container for elements that rely on separated lines for structure (such as
* lists). */
line { text "\n" }
/* Text may contain mark-up like links and font styles, but only a single
* contigous line of text (therefore no lists or other elements that span
* multiple lines). */
text { text-plain | text-italic | text-bold }
text-italic { "''" text "''" }
text-bold { "'''" text "'''" }
/* Plain-text may not contain additional markup. Plain-text may contain
* markup that is not to be displayed as markup. Umm. */
text-plain {
/* XXX: Define this. Make sure to include all UTF-8 characters. */
}
section { heading paragraph* }
/* Headings */
heading {
heading-1 | heading-2 | heading-3 |
heading-4 | heading-5 | heading-6
}
heading-1 { "=" text "=" }
heading-2 { "==" text "==" }
heading-3 { "===" text "===" }
heading-4 { "====" text "====" }
heading-5 { "=====" text "=====" }
heading-6 { "======" text "======" }
/* A single paragraph of text. May contain some multi-line constructs like
* lists, but not headings. */
paragraph {
(text | list)+
}
/* Signatures */
signature { signature-name | signature-name-date | signature-date }
signature-name { "~~~" }
signature-name-date { "~~~~" }
signature-date { "~~~~~" }
/* XXX: Match beginning/end of line */
ruler { "----" }
list { list-element* }
list-element { ("*" | "#" | ":" | ";")+ line }
comment { "<!--" plaintext "-->" }
tag { "<" /* XXX: What to put here? */ ">" }
[edit] Practical implementation
- Don't make many exceptions and special cases (for example: A closing </nowiki> tag is not required. If it is missing then the rest of the supplied text is treated as nowiki. [1]). Depreciate these obscure features and produce warnings, so that pages in violation can be detected and corrected.
- Allowing HTML was probably always a bad idea. Provide Wikitext replacements.
- Use a bot to validate existing pages with the new parser. Maintain list of pages that are not valid with the new parser.
[edit] See also
- http://bugs.wikimedia.org/show_bug.cgi?id=7
- http://en.wikipedia.org/wiki/Help:Editing
- http://meta.wikimedia.org/wiki/EBNF
- http://meta.wikimedia.org/wiki/MediaWiki_lexer
- http://meta.wikimedia.org/wiki/MediaWiki_flexer
- http://meta.wikimedia.org/wiki/Alternative_parsers
- http://meta.wikimedia.org/wiki/One-pass_parser
- http://www.mediawiki.org/wiki/User:HappyDog/WikiText_parsing
- http://www.mediawiki.org/wiki/Markup_spec

