Mallet script

I've been converting my blog over to this github approach today. I found a
problem when I copied text from a html page - you get nasty characters that
look like this in the editor:


That's instead of the word "I've".

There's something called 7-bit ascii. This seems to offer characters beyond
those that you can see on the keyboard. Such as tilted quotation marks rather
than the vertical one on your keyboard. I've knocked up a tool to hammer those
characters into keyboard-viewable characters.


import sys

SWAPS = [ ('\xe2\x80\x98', "'")
        , ('\xe2\x80\x99', "'")

        , ('\xe2\x80\x9c', '"')
        , ('\xe2\x80\x9d', '"')

        # emdash
        , ('\xe2\x80\x94', '--')


def read_file(fname):
    f_ptr = open(fname)
    data =
    print "Read %s"%fname
    return data

def write_file(fname, data):
    f_ptr = open(fname, 'w+')
    print "Wrote %s"%fname

def main():
    for fname in sys.argv[1:]:
        data = read_file(fname)
        for old, new in SWAPS:
            data = new.join( data.split(old) )
        write_file(fname, data)

if __name__ == '__main__':

When you run into this, use "hexdump -C filename | less" to find the spot
where the magic code is appearing in your text. And then make a new entry in
the SWAPS variable above.