User:Cmdrjameson/utf8tohtml.py
From Wikipedia, the free encyclopedia
A small script that blindly reads each line as UTF-8 (or failing that, Windows Codepage 1252) and converts any non-ASCII characters to HTML entities. The input defaults to stdin. The user may specify one or more filenames as command line arguments.
Requires Python 2.3 or better.
#!/usr/bin/env python
from htmlentitydefs import codepoint2name
import fileinput, re
reNonASCII = re.compile(u'[\u0080-\uffff]', re.UNICODE)
def replaceNonASCII(match):
'''Replace a unicode character with a named XHTML entity if possible,
and a decimal entity otherwise. Note, we do not concern ourselves
with escaping 'unsafe' characters such as &, we assume the input
text is already properly escaped.'''
c = ord(match.group())
try:
return '&%s;' % codepoint2name[c]
except KeyError:
return '&#%d;' % c
if __name__ == '__main__':
for l in fileinput.input():
try:
l = l.decode('utf-8')
except UnicodeDecodeError:
l = l.decode('windows-1252')
print reNonASCII.sub(replaceNonASCII, l),

