[ Pobierz całość w formacie PDF ]
.Unicode data is never ambiguous.Of course, there is still the matter of all these legacy encoding systems.7-bit ASCII, for instance, which storesEnglish characters as numbers ranging from 0 to 127.(65 is capital "A", 97 is lowercase "a", and so forth.) Englishhas a very simple alphabet, so it can be completely expressed in 7-bit ASCII.Western European languages likeFrench, Spanish, and German all use an encoding system called ISO-8859-1 (also called "latin-1"), which uses the7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters liken-with-a-tilde-over-it (241), and u-with-two-dots-over-it (252).And unicode uses the same characters as 7-bitASCII for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from thereinto characters for other languages with the remaining numbers, 256 through 65535.When dealing with unicode data, you may at some point need to convert the data back into one of these other legacyencoding systems.For instance, to integrate with some other computer system which expects its data in a specific1-byte encoding scheme, or to print it to a non-unicode-aware terminal or printer.Or to store it in an XML documentwhich explicitly specifies the encoding scheme.And on that note, let's get back to Python.Python has had unicode support throughout the language since version 2.The XML package uses unicode to storeall parsed XML data, but you can use unicode anywhere.Example 9.13.Introducing unicode>>> s = u'Dive in'>>> su'Dive in'>>> print sDive inTo create a unicode string instead of a regular ASCII string, add the letter "u" before the string.Note that thisparticular string doesn't have any non-ASCII characters.That's fine; unicode is a superset of ASCII (a verylarge superset at that), so any regular ASCII string can also be stored as unicode.When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII.(More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters,printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn'tknow that s was a unicode string, you'd never notice the difference.Example 9.14.Storing non-ASCII characters>>> s = u'La Pe\xf1a'>>> print sTraceback (innermost last):File "", line 1, in ?UnicodeError: ASCII encoding error: ordinal not in range(128)>>> print s.encode('latin-1')Dive Into Python 126 La PeñaThe real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish "ñ" (nwith a tilde over it).The unicode character code for the tilde-n is 0xf1 in hexadecimal (241 in decimal), whichyou can type like this: \xf1.Remember I said that the print function attempts to convert a unicode string to ASCII so it can print it? Well,that's not going to work here, because your unicode string contains non-ASCII characters, so Python raises aUnicodeError error.Here's where the conversion-from-unicode-to-other-encoding-schemes comes in.s is a unicode string, butprint can only print a regular string.To solve this problem, you call the encode method, available on everyunicode string, to convert the unicode string to a regular string in the given encoding scheme, which you pass asa parameter.In this case, you're using latin-1 (also known as iso-8859-1), which includes the tilde-n(whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through127).Remember I said Python usually converted unicode to ASCII whenever it needed to make a regular string out of aunicode string? Well, this default encoding scheme is an option which you can customize.Example 9.15.sitecustomize.py# sitecustomize.py# this file can be anywhere in your Python path,# but it usually goes in ${pythondir}/lib/site-packages/import syssys.setdefaultencoding('iso-8859-1')sitecustomize.py is a special script; Python will try to import it on startup, so any code in itwill be run automatically.As the comment mentions, it can go anywhere (as long as import canfind it), but it usually goes in the site-packages directory within your Python lib directory.setdefaultencoding function sets, well, the default encoding.This is the encoding schemethat Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.Example 9.16.Effects of setting the default encoding>>> import sys>>> sys.getdefaultencoding()'iso-8859-1'>>> s = u'La Pe\xf1a'>>> print sLa PeñaThis example assumes that you have made the changes listed in the previous example to yoursitecustomize.py file, and restarted Python.If your default encoding still says 'ascii', you didn't setup your sitecustomize.py properly, or you didn't restart Python.The default encoding can only bechanged during Python startup; you can't change it later.(Due to some wacky programming tricks that I won'tget into right now, you can't even call sys.setdefaultencoding after Python has started up.Dig intosite.py and search for "setdefaultencoding" to find out how.)Now that the default encoding scheme includes all the characters you use in your string, Python has no problemauto-coercing the string and printing it.Example 9.17.Specifying encoding in.py filesDive Into Python 127 If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of eachindividual.py file by putting an encoding declaration at the top of each file.This declaration defines the.py file tobe UTF-8:#!/usr/bin/env python# -*- coding: UTF-8 -*-Now, what about XML? Well, every XML document is in a specific encoding.Again, ISO-8859-1 is a popularencoding for data in Western European languages.KOI8-R is popular for Russian texts.The encoding, if specified, isin the header of the XML document.Example 9.18.russiansample.xmlÜ@548A;>285This is a sample extract from a real Russian XML document; it's part of a Russian translation of thisvery book.Note the encoding, koi8-r, specified in the header.These are Cyrillic characters which, as far as I know, spell the Russian word for "Preface".If you openthis file in a regular text editor, the characters will most likely like gibberish, because they're encodedusing the koi8-r encoding scheme, but they're being displayed in iso-8859-1.Example 9.19.Parsing russiansample.xml>>> from xml.dom import minidom>>> xmldoc = minidom.parse('russiansample.xml')>>> title = xmldoc.getElementsByTagName('title')[0].firstChild.data>>> titleu'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'>>> print titleTraceback (innermost last):File "", line 1, in ?UnicodeError: ASCII encoding error: ordinal not in range(128)>>> convertedtitle = title [ Pobierz caÅ‚ość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • elanor-witch.opx.pl
  •