亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

使用python PyHCUP 處理 hcup 數據集的asc 格式數據

系統 2156 0
原文鏈接: https://github.com/jburke5/pyhcup

文章大綱

  • 環境搭建
    • python 及jupyter 環境
    • conda 虛環境
  • About
  • Example Usage
    • Load a datafile/loadfile combination.
  • 樣例程序
  • Shortcut to loadfiles (meta data)
  • 參考文獻


翻譯: season

美國的一部分醫療數據是通過HIPPA 脫密后在 https://www.hcup-us.ahrq.gov/ 網站上對研究者開放進行探索的。但是由于她給出的數據格式為asc 的不常見格式,我們需要轉化成csv 后才能正常使用spark 等大數據分析組件進行分析。

還好2015年,有人用python 寫了一個調用SAS 解析hcup 數據的開源庫,那么今天我們就一起來探索一下,如何用python 對hcup 的asc 數據進行解析并使用。

環境搭建

python 及jupyter 環境

            
              
                # 設置環境變量
              
              
                export
              
               PATH
              
                =
              
              
                "/root/anaconda2/bin/:
                
                  $PATH
                
                "
              
              
                source
              
               ~/.bashrc

jupyter notebook --no-browser --port 8888 --ip
              
                =
              
              0.0.0.0 --allow-root

jupyter notebook  --generate-config
在~/home 或者c盤usrs administrators  下找到文件夾  .jupyter 修改jupyter_application_config.py 文件。


              
                # c.NotebookApp.notebook_dir = ''  去掉注釋 
              
            
          

conda 虛環境

            
              conda create -n iz_pyhcup --copy -y -q python
              
                =
              
              2.7 ipykernel pandas numpy

              
                source
              
               activate iz_pyhcup

              
                echo
              
              
                "y"
              
              
                |
              
              pip 
              
                install
              
               PyHCUP

              
                echo
              
              
                "y"
              
              
                |
              
              pip 
              
                install
              
               sqlalchemy

              
                source
              
               deactivate


            
          

About

PyHCUP is a Python library for parsing and importing data obtained from the United States Healthcare Cost and Utilization Program (http://hcup-us.ahrq.gov).


Data from HCUP come as a text file, with each column a specific width. However, the widths of these columns, and their names, are elsewhere. HCUP provide this meta data as either SAS or SPSS data loading programs.

PyHCUP is built to extract meta data from the SAS loading programs, then use that meta data to parse the actual data in the fixed-width text files. You’ll still need to acquire the actual data through HCUP.

A more verbose set of instructions is available in a series of posts on the author’s blog at

http://bielism.blogspot.com/2013/12/hcup-and-python-pt-i-background.html.


Example Usage

Load a datafile/loadfile combination.

            
              
                import
              
               pyhcup
 

              
                # specify where your data and loadfiles live
              
              
datafile 
              
                =
              
              
                'D:\\Users\\hcup\\sid\\NY_SID_2009_CORE.asc'
              
              
loadfile 
              
                =
              
              
                'D:\\Users\\hcup\\sid\\sasload\\NY_SID_2009_CORE.sas'
              
              
                # pull basic meta from SAS loadfile
              
              
meta_df 
              
                =
              
               pyhcup
              
                .
              
              meta_from_sas
              
                (
              
              loadfile
              
                )
              
              
                # use meta knowledge to parse datafile into a pandas DataFrame
              
              
df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              datafile
              
                ,
              
               meta_df
              
                )
              
              
                # that's it. use df from here.
              
            
          

Deal with very large files that cannot be held in memory in two ways.

  1. To import a subset of rows, such as for preliminary work or troubleshooting, specify nrows to read and/or skiprows to skip using sas.df_from_sas().
            
              
                # optionally specify nrows and/or skiprows to handle larger files
              
              
df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              datafile
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              
                500000
              
              
                ,
              
               skiprows
              
                =
              
              
                1000000
              
              
                )
              
            
          
  1. To iterate through chunks of rows, such as for importing into a database, first use the metadata to build lists of column names and widths. Next, pass a chunksize to the read() function above to create a generator yielding manageable-sized chunks.
            
              
chunk_size 
              
                =
              
              
                500000
              
              
reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              datafile
              
                ,
              
               meta_df
              
                ,
              
               chunksize
              
                =
              
              chunk_size
              
                )
              
              
                for
              
               df 
              
                in
              
               reader
              
                :
              
              
                # do your business
              
              
                # such as replacing sentinel values (below)
              
              
                # or inserting into a database with another Python library
              
            
          

Whether you are pulling in all records or just a chunk of records, you can also replace all those pesky missing/invalid data placeholders from HCUP (this is less useful for generically parsing missing values for non-HCUP files).

::

            
              # fyi, this bulldozes through all values in all columns with no per-column control
replaced = pyhcup.replace_sentinels(df)

            
          

樣例程序

上文提供了兩種加載大數據文件的辦法(原始文件一般非常大,一次性加載到pandas 中肯定會報錯),一種是迭代,一種是直接定位到某些行,進行子數據集的分析,下面給出一段樣例分析代碼,將hcup 數據集中的asc 文件轉化成標準csv

            
              
                #### save NY_SASD_2016_CORE.asc
              
              


filename 
              
                =
              
              
                "NY_SASD_2016_CORE.asc"
              
              

data_path 
              
                =
              
               filename
load_path 
              
                =
              
              
                'NY_SASD_2016_CORE.sas'
              
              
                #build a pandas DataFrame object from meta data
              
              
meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              



chunk_size 
              
                =
              
              
                500000
              
              
reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              data_path
              
                ,
              
               meta_df
              
                ,
              
               chunksize
              
                =
              
              chunk_size
              
                )
              
              


index 
              
                =
              
              
                1
              
              
                for
              
               df 
              
                in
              
               reader
              
                :
              
              
                if
              
               index
              
                ==
              
              
                1
              
              
                :
              
              
                #首先讀一次,去掉前兩行,生成文件
              
              
        index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
        df
              
                [
              
              
                2
              
              
                :
              
              
                ]
              
              
                .
              
              to_csv
              
                (
              
              
                'NY_SASD_2016_CORE.csv'
              
              
                ,
              
               index
              
                =
              
              
                None
              
              
                )
              
              
                else
              
              
                :
              
              
                #后面不帶header,追加文件
              
              
        index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
        df
              
                .
              
              to_csv
              
                (
              
              
                'NY_SASD_2016_CORE.csv'
              
              
                ,
              
               mode
              
                =
              
              
                'a'
              
              
                ,
              
               header
              
                =
              
              
                False
              
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
              
                print
              
              
                (
              
              index
              
                )
              
            
          

寫了兩個封裝的函數,對應的status 類的asc 文件進行csv 文件的導出

            
              
                ##################### 批量寫入 ####################################
              
              
                def
              
              
                write_hcupAsc_to_csv
              
              
                (
              
              file_name_for_status_And_Year
              
                )
              
              
                :
              
              
    filename 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".asc"
              
              
    load_path 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".sas"
              
              
    save_name 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".csv"
              
              
    
    meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              



    chunk_size 
              
                =
              
              
                500000
              
              
    reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               chunksize
              
                =
              
              chunk_size
              
                )
              
              


    index 
              
                =
              
              
                1
              
              
                for
              
               df 
              
                in
              
               reader
              
                :
              
              
                if
              
               index
              
                ==
              
              
                1
              
              
                :
              
              
                #首先讀一次,去掉前兩行,生成文件
              
              
            index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
            df
              
                [
              
              
                2
              
              
                :
              
              
                ]
              
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
               index
              
                =
              
              
                None
              
              
                )
              
              
                print
              
              
                (
              
              
                type
              
              
                (
              
              df
              
                [
              
              
                'KEY'
              
              
                ]
              
              
                .
              
              dtype
              
                )
              
              
                )
              
              
                else
              
              
                :
              
              
                #后面不帶header,追加文件
              
              
            index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
            df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
               mode
              
                =
              
              
                'a'
              
              
                ,
              
               header
              
                =
              
              
                False
              
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
              
                print
              
              
                (
              
              index
              
                )
              
              
                ########################### 測試寫入 從開頭第二行開始寫 nrows 行 ################################
              
              
                def
              
              
                write_Test_hcupAsc_to_csv
              
              
                (
              
              file_name_for_status_And_Year
              
                ,
              
              save_name
              
                ,
              
              nrows
              
                )
              
              
                :
              
              
    filename 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".asc"
              
              
    load_path 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".sas"
              
              
    save_name 
              
                =
              
               save_name 
              
                +
              
              
                ".csv"
              
              
    
    meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              

    df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              nrows
              
                ,
              
               skiprows
              
                =
              
              
                2
              
              
                )
              
              

    df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
            
          

還有一種讀取的方法,我們沒有用常用的chunksize,而是每次計算從特定位置開始讀取

            
              
                #第二種方式,不用chunksize
              
              

filename 
              
                =
              
              
                "NY_SID_2016_CORE.asc"
              
              

load_path 
              
                =
              
              
                'NY_SID_2016_CORE.sas'
              
              

save_name 
              
                =
              
              
                'NY_SID_2016_CORE.csv'
              
              
                #build a pandas DataFrame object from meta data
              
              
meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              
                #獲取文件行數
              
              

length 
              
                =
              
              
                len
              
              
                (
              
              
                [
              
              
                ""
              
              
                for
              
               line 
              
                in
              
              
                open
              
              
                (
              
              filename
              
                ,
              
              
                "r"
              
              
                )
              
              
                ]
              
              
                )
              
              
                print
              
              
                (
              
              length
              
                )
              
              

chunk_size 
              
                =
              
              
                500000
              
              

step 
              
                =
              
              
                int
              
              
                (
              
              length 
              
                /
              
              chunk_size
              
                )
              
              

df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              nrows
              
                ,
              
               skiprows
              
                =
              
              
                2
              
              
                )
              
              
df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              
                1
              
              
                ,
              
              step
              
                )
              
              
                :
              
              

    reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              chunk_size
              
                ,
              
               skiprows
              
                =
              
              
                2
              
              
                +
              
              i
              
                *
              
              chunk_size
              
                )
              
              

    df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
               mode
              
                =
              
              
                'a'
              
              
                ,
              
               header
              
                =
              
              
                False
              
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
            
          

Shortcut to loadfiles (meta data)

The SAS loading program files provided by HCUP for the State Inpatient Database (SID), State Ambulatory Surgery Database (SASD), and State Emergency Department Database (SEDD) are bundled in this package for easy access. You can retrieve the meta data for these directly, without having to specify a loadfile path as described above.

Acquire meta in this way using the get_meta() function. You must pass a state abbreviation as the first argument and a year as the second arugment, like so.

            
              meta_df 
              
                =
              
               pyhcup
              
                .
              
              get_meta
              
                (
              
              
                'NY'
              
              
                ,
              
              
                2009
              
              
                )
              
            
          

By default, get_meta() acquires SID CORE data. Other meta can be acquired with the optional keyword arguments datafile (‘SID’, ‘SEDD’, or ‘SASD’) and category (‘CORE’, ‘CHGS’, ‘SEVERITY’, ‘DX_PR_GRPS’, or ‘AHAL’).

            
              
                # California emergency department charges meta for 2010
              
              
ca_2010_emergency_charges_meta 
              
                =
              
               pyhcup
              
                .
              
              get_meta
              
                (
              
              
                'CA'
              
              
                ,
              
              
                2010
              
              
                ,
              
               datafile
              
                =
              
              
                'SEDD'
              
              
                ,
              
               category
              
                =
              
              
                'CHGS'
              
              
                )
              
              
                # Arizona outpatient surgery DRG records meta for 2004
              
              
az_2004_surg_groups_meta 
              
                =
              
               pyhcup
              
                .
              
              get_meta
              
                (
              
              
                'AZ'
              
              
                ,
              
              
                2004
              
              
                ,
              
               datafile
              
                =
              
              
                'SASD'
              
              
                ,
              
               category
              
                =
              
              
                'DX_PR_GRPS'
              
              
                # etc.
              
            
          

參考文獻

http://bielism.blogspot.com/2013/12/hcup-and-python-pt-5-nulls-and-pre.html


更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號聯系: 360901061

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對您有幫助就好】

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長會非常 感謝您的哦!!!

發表我的評論
最新評論 總共0條評論
主站蜘蛛池模板: 日韩免费不卡视频 | 国内偷自视频区视频综合 | 国产91精品福利在线观看 | 日本在线色 | 久久五月天综合 | 久久精品国产99久久无毒不卡 | 人人爱天天做夜夜爽88 | 久久riav| 特黄十八岁大片 | 欧美一区在线观看视频 | 亚洲欧美v视色一区二区 | 国产成人亚洲影视在线 | 欧美一级午夜免费视频你懂的 | 欧美性禁片在线观看 | 欧美日韩北条麻妃一区二区 | 久久中文在线 | 免看一级a一片成人123 | 久久精品国产400部免费看 | 日韩中文在线观看 | a级高清观看视频在线看 | 日韩精品一区二区三区免费观看 | 女人18毛片特级一级免费视频 | 日本护士a做爰免费观看 | 日本欧美视频在线 | 日本高清免费毛片久久看 | 精品影视| 特黄特黄一级高清免费大片 | 久九精品 | 国产精品福利影院 | jizjizjiz亚洲大全 | 91精品国产综合久久欧美 | 天天做天天玩天天爽天天 | 亚洲国产成人久久精品hezyo | 久久免费视频在线 | 一级黄色免费毛片 | 国产精品亚洲专区在线播放 | 天天做.天天爱.天天综合网 | 欧美成人黄色网 | 色汉综合| 成年女人在线观看片免费视频 | 一级高清在线观看影片 |