Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

What's the best way to remove accents in unicode?


Asked by Magnus Krueger on Dec 14, 2021 FAQ



The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics. unicodedata.combining (c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic. Edit 2: remove_accents expects a unicode string, not a byte string.
Besides,
The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics. import unicodedata def remove_accents (input_str): nfkd_form = unicodedata.normalize ('NFKD', input_str) return u"".join ( [c for c in nfkd_form if not unicodedata.combining (c)])
Similarly, The email exchange reflects what I wrote above: Because the letter "ł" is not an accented letter (and is not treated as one in the Unicode standard), it does not have a decomposition. – alexis
Furthermore,
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain ‘WITH’. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
Indeed,
The character category"Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit). And keep in mind, these manipulations may significantly alter the meaning of the text.