亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

創造優秀的程序之必備知識:字符編碼(2)—軟件

系統 1742 0

軟件開發者必須知道的Unicode和字符編碼


這是一篇翻譯自Joel Spolsky的文章“The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets”,比較經典。

[翻譯時為增加可讀性,有少許改動]

原文:http://www.joelonsoftware.com/articles/Unicode.html


你曾經因為html文件里的Content-Type標簽感到迷惑嗎?比如:

    
      <
      
        meta
      
      
         http-equiv
      
      =
      
        "Content-Type" 
      
      
        content
      
      =
      
        "text/html; charset=utf-8" 
      
      
        
          /
        
      
      >
    
  
經常見到吧,卻不知道它是用來做什么的?

你有沒有收到過從外國發來的郵件(比如保加利亞),標題不能正常顯示,變成了 "???? ?????? ??? ????"?


令人沮喪的是,我發現很多軟件開發者不太了解這個神秘的世界:字符集,字符編碼,Unicode等類似的東西。2年前, FogBUGZ beta的一個測試人員想知道這個軟件能不能處理處理從日本發來的郵件,用日文寫的郵件。[Mark,不太清楚這里要表達的意思]。當查看那些我們用來解析MIME的電子郵件內容時,我發現它完全是用錯誤的方式來進行字符編碼的轉換,所以我只能寫一段代碼來復原那些錯誤的轉換并且從新以正確的方式操作字符編碼。

而當我再看到另一個商用的函數庫時,沒錯,關于字符編碼相關的實現也很糟糕。我聯系到了它的開發者,他說沒有什么解決辦法。就像很多程序員一樣,他也寄希望于那些問題會“自己消失”,真是莫名其妙。

讓問題“自己消失”,這是不可能的。When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues , blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough .

所以我有一個消息要告訴你:如果你現在是一個程序員而且對計算機字符,字符集,編碼還有 Unicode的基本內容似乎并不了解。那我就正好逮到你了。我發誓要讓你蹲在潛艇里,罰你連續削6個月的圓蔥。

我還要告訴你:這些看似高深的東西,其實沒有那么難懂。

本文會告訴你每個程序員都應該了解的最基本內容。之前的一切關于“純文本==ASCII碼編碼的字符串==每個字符為一個字節的字符串”( "plain text = ascii = characters are 8 bits")不僅是錯誤的,而且是無可救藥地錯了。如果你現在還按照這樣的方式編寫程序,那就像一個醫生不相信細菌存在一樣。所以在讀完本文之前千萬不要去寫程序了。

在我們開始之前,我應該先提醒你,如果你是那些少數幾個了解國際化的人,你可能會覺得以下討論的內容有一點太簡單了。我試著將門檻降到最低,這樣所有人都能理解大致原理并且可以寫出能操縱任何語言的代碼,而不僅僅是英文字符串了。而且我也應該告訴你字符相關的技術只是創造國際化通用軟件的一小部分,因為我每次只能寫一樣東西,所以今天就寫了關于字符集、字符編碼的內容。

從歷史講起

理解它們最簡單的方式就是按照時間順序敘述它的發展。

但是我們現在不談EBCDIC,因為它實在太老了,那段歷史已經沒必要談了。

Back in the semi-olden days,Unix剛出現的時候,當K&R還在寫 The C Programming Language 的時候,一切都是那么簡單。EBCDIC也即將消失。那個時候計算機僅能表示那些不帶變音符號(注:歐洲部分語言在字母的頂端有變音符號)的英文字符,我們把它叫做 ASCII ,在ASCII碼表中,從編號32到127承載著所有的字符。空格是32號,字母“A”是65號,等等。因此這些字符能方便地存儲在7個2進制位里。那個時候大部分電腦都以8個2進制位為一個字節,所以一個字節里不僅能存儲所有的ASCII字符,還有1個2進制位的空位。如果你有點壞,可以把它用在其他邪惡的目的:曾經有個文字處理軟件WordStar填補了那個空位,以標定單詞的尾部。 編號32以下的稱為不可打印字符。它們用來讓你發牢騷。哈哈,只是個玩笑而已,其實它們被用作控制字符,比如7可以讓電腦蜂鳴,而12可以退出正在打印的頁面再裝入新的頁面。

ASCII table

假設你使用的語言是美國英語,一切看起來似乎都挺好。

由于一個字節里還剩下很多空間,很多人會想到“天哪,我們可以把128到255留給自己用”。問題是,很多人的想法類似,他們按照自己的意愿在128到255之間填充自己的內容。

The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters ... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners'. In factas soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel ( ? ), so when Americans would send their résumés to Israel they would arrive as <nobr>r<img alt="?" src="http://www.joelonsoftware.com/pictures/unicode/gimel.png" border="0" height="9" width="5" style="margin-right:2px">sum<img alt="?" src="http://www.joelonsoftware.com/pictures/unicode/gimel.png" border="0" height="9" width="5" style="margin-right:2px">s.</nobr> In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages . So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a co 創造優秀的程序之必備知識:字符編碼(2)—軟件開發者必須知道的Unicode和字符編碼 mplete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.




創造優秀的程序之必備知識:字符編碼(2)—軟件開發者必須知道的Unicode和字符編碼


更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號聯系: 360901061

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對您有幫助就好】

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長會非常 感謝您的哦!!!

發表我的評論
最新評論 總共0條評論
主站蜘蛛池模板: 欧美一级毛片免费网站 | 中文字幕在线精品视频万部 | 中国特黄特级真人毛片 | 伊人一道本 | www.天天干 | 91精品视频在线看 | 中文字幕精品在线视频 | 不卡午夜视频 | 99热久久免费精品首页 | 四虎在线视频观看大全影视 | 91伦理视频 | 国产精品香蕉在线一区二区 | 人人干人人模 | 国产精品天干天干 | 久操这里只有精品 | 久久天天躁夜夜躁狠狠 | 正在播放亚洲一区 | 中文字幕国产视频 | 亚洲精品一区二区三区婷婷 | 亚洲最大在线视频 | 中文字幕免费在线播放 | 韩日精品在线 | 国产在线观看一区精品 | 国产福利在线看 | 羞羞视频网 | 亚洲欧洲尹人香蕉综合 | 亚洲人成一区二区三区 | 欧美性猛交ⅹxxx乱大交按摩 | 94欧美| 久久国产乱子伦精品免费强 | 久久视精品| 午夜国产精品影院在线观看 | 99久久国产亚洲综合精品 | 99热国产这里只有精品99 | 久久精品亚洲日本筱田优 | 亚洲涩涩精品专区 | 国产麻豆精品aⅴ免费观看 国产麻豆精品hdvideoss | 亚洲天天做日日做天天欢毛片 | 香蕉综合视频 | 九九热免费观看 | 日本一级毛片片在线播放 |