PyPheWAS Inputs and Outputs#
This document describes all of the input flags for the pyphewas command as well as the different columns in the output file.
PheWAS Inputs:#
Required Inputs
–counts: Filepath to a comma-separated file where each row has an ID, a phecode ID, and the number of times that individual has that phecode in their medical record. The file should only have 3 columns and an error will be raised if it has any other number of columns.
person_id |
phecode_id |
count |
|---|---|---|
1001 |
250.2 |
5 |
1002 |
401.1 |
12 |
1003 |
296.22 |
1 |
–covariate-file: Filepath to a comma-separated file that lists the covariates and predictor for each individual. The column containing sample IDs should be named the same as the corresponding column in the counts file. The individuals listed in the covariate file will be the individuals in the cohort. Note if the ‘–flip-predictor-and-outcome’ flag is used, then the predictor variable is assumed to be the outcome in the model.
–covariate-list: Space-separated list of covariates to use in the model. All of these covariates must be present in the covariate file and must be spelled exactly the same; otherwise, the code will crash.
–phecode-version: String indicating which version of phecodes to use. This argument helps with mapping the PheCode ID to a description. The allowed values are “phecodeX”, “phecode1.2”, “phecodeX_who”, or “None”. Most users will only need to use either the PhecodeX or Phecode1.2 option.
Optional Inputs
–min-phecode-count: Minimum number of phecodes an individual is required to have in order to be considered a case for a phecode. Default value is 2. Under default settings, all individuals with 1 occurrence of the phecode are excluded from the regression. If this value is set to 1, then there are no excluded individuals.
–min-case-count: Minimum number of cases a phecode has to have to be included in the analysis. The default value is 20. There is no rigorous testing behind this value, only convention. For more rigorous results, a more conservative value of 100 may be ideal.
–status-col: Column name for the column in the covariate file that has the predictor’s case/control status. This file should be a comma-separated file. Default value is “status”.
–sample-col: Column name for the column in the covariates file that has the individual IDs. Default value is “person_id”.
–output: Filename to write the output to. The output will be written as a tab-separated file. If the suffix of the file ends in .gz, then the file will be gzip-compressed; otherwise, the file will be uncompressed. Default value is test_output.txt.
–phecode-descriptions: Filepath to a comma-separated file that lists the phecode ID and the corresponding phecode name. There are default description files stored in the ‘./src/phecode_maps/’ folder if you wish to see example files that are currently used in the code. The phecode ID is expected to be the first column while the phecode description is expected to be the fourth column.
–model: Type of regression model to use for the analysis. The two options are ‘logistic’ or ‘linear’. Default option is logistic.
–cpus: Number of cpus to use during the analysis. Default value is 1.
–max-iterations: Number of iterations for the regression to try to converge. If the model doesn’t converge after reaching the max iteration threshold, then a ConvergenceWarning will be thrown. If you run this code and find that many PheCodes are not converging, then it is recommended to increase this value to attempt to get more phecodes to converge. Default value is 200.
–flip-predictor-and-outcome: Depending on the analysis, you may want the status column in the covariate file to be a predictor or to be the outcome. If you want the status to be the outcome, then you can supply this flag as ‘–flip-predictor-and-outcome’. When the status is the outcome, then the case/control status for the individual phecodes will become the predictor.
–run-sex-specific: Depending on the analysis, you may also want to restrict the analysis to a sex stratified cohort. This command is one of three flags that have to be used in tandem that allow you to stratify the analysis. Allowed values are ‘male-only’ and ‘female-only’.
–male-as-one: If the ‘–run-sex-specific’ flag is used, then this flag also has to be passed indicating if males were coded as 1 and females as 0 or vice versa. You could pass this flag as ‘–male-as-one’ to indicate that males were coded as 1. The default value is True although this flag will be ignored if the ‘–run-sex-specific’ flag is not provided.
–sex-col: Column name of the column in the covariate field containing Sex or Gender information. This flag is required if the ‘–run-sex-specific’ flag was used.
–verbose (or -v): Verbose flag indicating if the user wants more information. This is a counting flag, so adding more ‘v’s (e.g., -vv) will increase the verbosity level.
–log-to-console: Optional flag to log output to the console in addition to the log file.
–log-filename: Name for the log output file. Default value is pyphewas.log.
–version: Show the program’s version number and exit.
PheWAS Output:#
This command outputs a text file that has the results for each phecode run in the analysis. This file will be gzip-compressed if the file has a “.gz” suffix. The columns of this file are described below:
phecode: PheCode ID from the PheWAS catalogue (Ex: “499”).
phecode_description: Text description of the phecode (Ex: “Cystic Fibrosis for PheCode 499”).
phecode_category: The “class” that this particular phecode belongs to. This value is used to group the individual phecodes in the Manhattan plot (Ex: “Endocrine/Metab”).
case_count: Number of participants who were classified as cases for the particular phecode.
control_count: Number of individuals who were classified as controls for the particular phecode.
converged: Whether or not the regression model converged for this specific phecode.
- The next three columns are the output statistics for the model.
Three columns are made for every term in your model. This means that if you had three independent variables, then you would have 9 additional output columns.
*pvalue: P-value calculated by the regression model.
*beta: Beta estimates from the regression model.
*stderr: Standard error for the betas.