Souci d'encodage : remplacer un mot français accentuéRésolu

Question

Bonjour,
J'utilise un script python pour remplacer un mot par un autre dans un texte (avec bash). Il fonctionne bien avec des mots anglais, mais pas avec des mots français accentués...

Voici le script :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import re
from_str,to_str=sys.argv[1:3]
contents=sys.stdin.read()
pat_str=r'(?<!-)\b%s\b(?=[^-]|\Z)'%(from_str)
pat=re.compile(pat_str,re.IGNORECASE)
result=pat.sub(to_str,contents)
print(result)

Ensuite avec bash, les exemples (tordus mais prévoyants) suivants donnent :

$ echo "green-this this this this-green" | erg-replace.py "this" "yellow"
green-this yellow yellow this-green

En anglais le mot non composé "this" est donc correctement remplacé.

$ echo rouge-ça ça ça ça-vert | erg-replace.py ça jaune
rouge-ça ça ça ça-vert

En français, le caractère "ç" ne passe pas...

Merci de l'aide,
Thibaud.

thulin · Accepted Answer

OK, la solution, c'est d'utiliser python 3...

thulin · Answer

Cf. aussi cette discussion (en) : http://ubuntuforums.org/showpost.php?p=7791724&postcount=19 pour le même script en perl

cs_Dobel · Answer

C'est quand même faisable avec du python <= 2.6 malgré son charset par défaut

#!/usr/bin/python2.6
# -*- coding: utf-8 -*-

import sys
import re
import locale

from_str, to_str = sys.argv[1:3]

# get encoding from the current locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
else:
encoding = "utf-8"

contents = sys.stdin.read().decode(encoding)
to_str = to_str.decode(encoding)
from_str = from_str.decode(encoding)
pattern = r"(?<!-)\b%s\b(?=[^-]|\Z)" % from_str

regexp = re.compile(pattern, re.IGNORECASE | re.UNICODE)

print regexp.sub(to_str, contents)


$ echo -n "rouge-ça ça ça ça-vert" | ./toto.py ça jaune
rouge-ça jaune jaune ça-vert
$ echo -n "ß ß-" | ./toto.py ß Æ
Æ ß-
$

Souci d'encodage : remplacer un mot français accentué

3 réponses

Votre réponse

Discussions similaires