Replace diacritic characters with "equivalent" ASCII in PHP?

Asked
Active3 hr before
Viewed126 times

7 Answers

charactersreplace
90%

As in the questions above, I'm looking for a reliable, robust way to reduce any unicode character to near-equivalent ASCII using PHP. I really want to avoid rolling my own look up table.,The main hassle with iconv is that you just have to watch your encodings, but it's definitely the right tool for the job (I used 'Windows-1252' for the example due to limitations of the text editor I was working with ;) The feature of iconv that you definitely want to use is the //TRANSLIT flag, which tells iconv to transliterate any characters that don't have an ASCII match into the closest approximation., 1 if the character is not found, a "?" replaces that special char. this should not be the most voted answer. it's misleading – machineaddict Apr 27 '16 at 12:26 , Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers

The iconv module can do this, more specifically, the iconv() function:

$str = iconv('Windows-1252', 'ASCII//TRANSLIT//IGNORE', "Gracišce");
echo $str;
//outputs "Gracisce"
load more v
88%

Checks if all of the characters in the provided string, text, are alphabetic. In the standard C locale letters are just [A-Za-z] and ctype_alpha() is equivalent to (ctype_upper($text) || ctype_lower($text)) if $text is just a single character, but other languages have letters that are considered neither upper nor lower case. , Returns true if every character in text is a letter from the current locale, false otherwise. ,Example #1 A ctype_alpha() example (using the default locale),ctype_lower() - Check for lowercase character(s)

The string KjgWZC consists of all letters.
The string arf12 does not consist of all letters.
72%

The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments. ,^ "Appendix A: Notational Conventions" (PDF). The Unicode Standard. Unicode Consortium. September 2021. In conformity with the bullet point relating to Unicode in MOS:ALLCAPS, the formal Unicode names are not used in this paragraph. ,The Unicode Character Database, a text document listing the names, code points and properties of all Unicode characters,Unicode partially addresses the newline problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of characters that conforming applications should recognize as line terminators.

The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or byte endianness detection). The BOM, code point U+FEFF, has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in places other than the beginning of text conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of ligatures).

The same character converted to UTF-8 becomes the byte sequence EF BB BF. The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".[71] Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit code pages. However RFC 3629, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM.

EF BB BF
load more v
65%

A a french developer, I often come across non-ASCII characters in user-input data. In order to generate clean, search friendly equivalents, I created the following function that removes the accents while preserving the string integrity.,DEV Community – A constructive and inclusive social network for software developers. With you every step of your journey.,Built on Forem — the open source software that powers DEV and other inclusive communities., We're a place where coders share, stay up-to-date and grow their careers.

$str = "À l'île, en été, quelle félicité !";

echo accent2ascii($str); // A l'ile, en ete, quelle felicite
load more v
75%

Convert double-encoded UTF-8 characters to proper UTF-8 characters,Set UTF-8 as the character set for all headers output by your PHP code,Specify UTF-8 as the character set for all HTML content,Set UTF-8 as the default character set for all MySQL connections

The first thing you need to do is to modify your php.ini file to use UTF-8 as the default character set:

	default_charset = "utf-8";

In every PHP output header, specify UTF-8 as the encoding:

  header('Content-Type: text/html; charset=utf-8');

Specify UTF-8 as the encoding type for XML

  < ? xml version = "1.0"
  encoding = "UTF-8" ? >

Since not all UTF-8 characters are accepted in an XML document, you’ll need to strip any such characters out from any XML that you generate. A useful function for doing this (which I found here) is the following:

  function utf8_for_xml($string) {
     return preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u',
        ' ', $string);
  }

Here’s how you can use this function in your code:

  $safeString = utf8_for_xml($yourUnsafeString);

For HTML content, specify UTF-8 as the encoding:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

In HTML forms, specify UTF-8 as the encoding:

  <form accept-charset="utf-8">

e.g.:

  htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")

Specify UTF-8 as the default character set to use when exchanging data with the MySQL database using mysql_set_charset:

  $link = mysql_connect('localhost', 'user', 'password');
  mysql_set_charset('utf8', $link);

Note that, as of PHP 5.5.0, mysql_set_charset is deprecated, and mysqli::set_charset should be used instead:

  $mysqli = new mysqli("localhost", "my_user", "my_password", "test");

  /* check connection */
  if (mysqli_connect_errno()) {
     printf("Connect failed: %s\n", mysqli_connect_error());
     exit();
  }

  /* change character set to utf8 */
  if (!$mysqli - > set_charset("utf8")) {
     printf("Error loading character set utf8: %s\n", $mysqli - > error);
  } else {
     printf("Current character set: %s\n", $mysqli - > character_set_name());
  }

  $mysqli - > close();

Set the following config parameters after each corresponding tag:

  [client]
  default -character - set = UTF - 8

  [mysql]
  default -character - set = UTF - 8

  [mysqld]
  character - set - client - handshake = false #force encoding to uft8
  character - set - server = UTF - 8
  collation - server = UTF - 8_ general_ci

  [mysqld_safe]
  default -character - set = UTF - 8

To verify that everything has properly been set to use the UTF-8 encoding, execute the following query:

  mysql > show variables like 'char%';

The output should look something like:

      | character_set_client | UTF - 8 |
         character_set_connection | UTF - 8 |
         character_set_database | UTF - 8 |
         character_set_filesystem | binary |
         character_set_results | UTF - 8 |
         character_set_server | UTF - 8 |
         character_set_system | UTF - 8 |
         character_sets_dir | /usr/share / mysql / charsets /

If the connecting client has no way to specify the encoding for its communication with MySQL, after the connection is established you may have to run the following command/query:

  set names UTF - 8;

Set your index definition to have:

charset_type = utf - 8

Add the following to your source definition:

sql_query_pre = SET CHARACTER_SET_RESULTS = UTF - 8
sql_query_pre = SET NAMES UTF - 8

Execute the following command:

 ALTER SCHEMA `your-db-name`
 DEFAULT CHARACTER SET UTF - 8;

Via command line, verify that everything is properly set to UTF-8

 mysql > show variables like 'char%';

Create a dump file with latin1 encoding for the table you want to convert:

 mysqldump - u USERNAME - pDB_PASSWORD--opt--skip - set - charset--
 default -character - set = latin1
    --skip - extended - insert DATABASENAME--tables TABLENAME >
    DUMP_FILE_TABLE.sql

e.g:

 mysqldump - u root--opt--skip - set - charset--
 default -character - set = latin1
    --skip - extended - insert artists - database--tables tbl_artist >
    tbl_artist.sql

e.g., using Perl:

 perl - i - pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=UTF-8/'
 DUMP_FILE_TABLE.sql

From this point, we will start messing with the database data, so it would probably be prudent to backup the database if you haven’t already done so. Then, restore the dump into the database:

 mysql > source "DUMP_FILE_TABLE.sql";

See if there are any records with multi-byte characters (if this query returns zero, then there don’t appear to be any records with multi-byte characters in your table and you can proceed to Step 8).

  mysql > select count( * ) from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD);

Copy rows with multi-byte characters into a temporary table:

  create table temptable(
     select * from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD));

e.g.:

  alter table temptable modify temptable.ArtistName varchar(128) character set latin1;

e.g.:

  alter table temptable modify temptable.ArtistName blob;
  alter table temptable modify temptable.ArtistName varchar(128) character set UTF - 8;

Remove rows with only single-byte characters from the temporary table:

  delete from MY_TABLE where LENGTH(MY_FIELD) = CHAR_LENGTH(MY_FIELD);

Re-insert fixed rows back into the original table (before doing this, you may want to run some selects on the temptable to verify that it appears to be properly corrected, just as a sanity check).

  replace into MY_TABLE(select * from temptable);
load more v
40%

OCTL: the C/Modula-3 octal notation; , HTML: the HTML notation (decimal); , Character codes and names , HEX: the hexadecimal code, as used e.g. in MIME quoted-printable encoding;

        !" # $ % & ' ( ) * + , - . / 
        0 1 2 3 4 5 6 7 8 9: ; < = > ? @
        A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
           [\] ^ _ `
      a b c d e f g h i j k l m n o p q r s t u v w x y z 
      { | } ~

        � � � � � � � � � � � � � � � 
      � � � � � � � � � � � � � � � �

      � � � � � � � � � � � � � � � � � � � � � � �
      �
      � � � � � � �
      �
      � � � � � � � � � � � � � � � � � � � � � � �
      �
      � � � � � � �
      �
       
load more v
22%

The normalize() method returns the Unicode Normalization Form of the string. ,A string containing the Unicode Normalization Form of the given string.,Unicode Standard Annex #15, Unicode Normalization Forms, In Unicode, two sequences of code points are compatible if they represent the same abstract characters, and should be treated alike in some — but not necessarily all — applications.

normalize()
normalize(form)
load more v

Other "characters-replace" queries related to "Replace diacritic characters with "equivalent" ASCII in PHP?"