亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

創造優秀的程序之必備知識:字符編碼(2)—軟件

系統 1742 0

軟件開發者必須知道的Unicode和字符編碼


這是一篇翻譯自Joel Spolsky的文章“The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets”,比較經典。

[翻譯時為增加可讀性,有少許改動]

原文:http://www.joelonsoftware.com/articles/Unicode.html


你曾經因為html文件里的Content-Type標簽感到迷惑嗎?比如:

    
      <
      
        meta
      
      
         http-equiv
      
      =
      
        "Content-Type" 
      
      
        content
      
      =
      
        "text/html; charset=utf-8" 
      
      
        
          /
        
      
      >
    
  
經常見到吧,卻不知道它是用來做什么的?

你有沒有收到過從外國發來的郵件(比如保加利亞),標題不能正常顯示,變成了 "???? ?????? ??? ????"?


令人沮喪的是,我發現很多軟件開發者不太了解這個神秘的世界:字符集,字符編碼,Unicode等類似的東西。2年前, FogBUGZ beta的一個測試人員想知道這個軟件能不能處理處理從日本發來的郵件,用日文寫的郵件。[Mark,不太清楚這里要表達的意思]。當查看那些我們用來解析MIME的電子郵件內容時,我發現它完全是用錯誤的方式來進行字符編碼的轉換,所以我只能寫一段代碼來復原那些錯誤的轉換并且從新以正確的方式操作字符編碼。

而當我再看到另一個商用的函數庫時,沒錯,關于字符編碼相關的實現也很糟糕。我聯系到了它的開發者,他說沒有什么解決辦法。就像很多程序員一樣,他也寄希望于那些問題會“自己消失”,真是莫名其妙。

讓問題“自己消失”,這是不可能的。When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues , blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough .

所以我有一個消息要告訴你:如果你現在是一個程序員而且對計算機字符,字符集,編碼還有 Unicode的基本內容似乎并不了解。那我就正好逮到你了。我發誓要讓你蹲在潛艇里,罰你連續削6個月的圓蔥。

我還要告訴你:這些看似高深的東西,其實沒有那么難懂。

本文會告訴你每個程序員都應該了解的最基本內容。之前的一切關于“純文本==ASCII碼編碼的字符串==每個字符為一個字節的字符串”( "plain text = ascii = characters are 8 bits")不僅是錯誤的,而且是無可救藥地錯了。如果你現在還按照這樣的方式編寫程序,那就像一個醫生不相信細菌存在一樣。所以在讀完本文之前千萬不要去寫程序了。

在我們開始之前,我應該先提醒你,如果你是那些少數幾個了解國際化的人,你可能會覺得以下討論的內容有一點太簡單了。我試著將門檻降到最低,這樣所有人都能理解大致原理并且可以寫出能操縱任何語言的代碼,而不僅僅是英文字符串了。而且我也應該告訴你字符相關的技術只是創造國際化通用軟件的一小部分,因為我每次只能寫一樣東西,所以今天就寫了關于字符集、字符編碼的內容。

從歷史講起

理解它們最簡單的方式就是按照時間順序敘述它的發展。

但是我們現在不談EBCDIC,因為它實在太老了,那段歷史已經沒必要談了。

Back in the semi-olden days,Unix剛出現的時候,當K&R還在寫 The C Programming Language 的時候,一切都是那么簡單。EBCDIC也即將消失。那個時候計算機僅能表示那些不帶變音符號(注:歐洲部分語言在字母的頂端有變音符號)的英文字符,我們把它叫做 ASCII ,在ASCII碼表中,從編號32到127承載著所有的字符。空格是32號,字母“A”是65號,等等。因此這些字符能方便地存儲在7個2進制位里。那個時候大部分電腦都以8個2進制位為一個字節,所以一個字節里不僅能存儲所有的ASCII字符,還有1個2進制位的空位。如果你有點壞,可以把它用在其他邪惡的目的:曾經有個文字處理軟件WordStar填補了那個空位,以標定單詞的尾部。 編號32以下的稱為不可打印字符。它們用來讓你發牢騷。哈哈,只是個玩笑而已,其實它們被用作控制字符,比如7可以讓電腦蜂鳴,而12可以退出正在打印的頁面再裝入新的頁面。

ASCII table

假設你使用的語言是美國英語,一切看起來似乎都挺好。

由于一個字節里還剩下很多空間,很多人會想到“天哪,我們可以把128到255留給自己用”。問題是,很多人的想法類似,他們按照自己的意愿在128到255之間填充自己的內容。

The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters ... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners'. In factas soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel ( ? ), so when Americans would send their résumés to Israel they would arrive as <nobr>r<img alt="?" src="http://www.joelonsoftware.com/pictures/unicode/gimel.png" border="0" height="9" width="5" style="margin-right:2px">sum<img alt="?" src="http://www.joelonsoftware.com/pictures/unicode/gimel.png" border="0" height="9" width="5" style="margin-right:2px">s.</nobr> In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages . So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a co 創造優秀的程序之必備知識:字符編碼(2)—軟件開發者必須知道的Unicode和字符編碼 mplete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.




創造優秀的程序之必備知識:字符編碼(2)—軟件開發者必須知道的Unicode和字符編碼


更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號聯系: 360901061

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對您有幫助就好】

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長會非常 感謝您的哦!!!

發表我的評論
最新評論 總共0條評論
主站蜘蛛池模板: 深夜影院老司机69影院 | 香香影院在线观看 | 香蕉网在线播放 | 国产视频二 | 久久在线观看免费视频 | 99久久99久久久99精品齐 | 久久精品vr中文字幕 | 大尺度视频网站久久久久久久久 | 国产理论精品 | 国产真实偷人视频在线播放 | 久久天天躁狠狠躁夜夜2020一 | 九九99视频在线观看视频观看 | 亚洲精品99久久一区二区三区 | 免费费看的欧亚很色大片 | 模特视频一二三区 | 老司机永久免费网站在线观看 | 色www精品视频在线观看 | 99久久国产综合精麻豆 | 久久久婷婷亚洲5月97色 | 香蕉成人影院 | bbbb成人毛片免费看 | 四虎澳门永久8848在线影院 | 国产亚洲一区二区麻豆 | 久久精品99视频 | 亚洲欧美片 | 日本不卡视频在线播放 | 蜜月tv | 欧美一级精品高清在线观看 | 99re热视频在线 | 91成人精品 | 亚洲一区二区久久 | 久久99精品国产自在现线小黄鸭 | 欧美黄视频在线观看 | 在线欧美视频免费观看国产 | 年级的后妈妈2中文翻译 | 日韩欧美中文字幕一区 | 草莓视频caomei888 | 国产成人亚洲综合网站不卡 | 日韩亚洲一区中文字幕 | 99er久久| 老子影院午夜伦手机不卡无 |