Skip to content

The REF prefixes differ error in VCF merging #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hsun3163 opened this issue Apr 1, 2022 · 2 comments
Closed

The REF prefixes differ error in VCF merging #207

hsun3163 opened this issue Apr 1, 2022 · 2 comments

Comments

@hsun3163
Copy link
Collaborator

hsun3163 commented Apr 1, 2022

While merging the vcf output from our sumstat merger, the following error occurs:

The REF prefixes differ: G vs A (1,1)
Failed to merge alleles at 1:896680 in /mnt/mfs/statgen/snuc_pseudo_bulk/eight_tissue_analysis/8test/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.1/ALL.log2cpm.bed.chr1.norminal.cis_long_table.vcf.gz

This is caused by:

1       896680  chr1:896680_A_G A       G       .       PASS    GENE=LINC01409  STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT -0.033299796:0.0425892:0.4348487535797959:117933:0.34819278:249:289
1       896680  chr1:896680_G_T G       T       .       PASS    GENE=LINC01409  STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT -0.12872803:0.12619513:0.3084487106650284:117933:0.024096385:20:20
1       896680  chr1:896680_A_G A       G       .       PASS    GENE=LINC01128  STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT -0.0072335657:0.040994305:0.8600473508955363:71542:0.34819278:249:289
1       896680  chr1:896680_G_T G       T       .       PASS    GENE=LINC01128  STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT -0.019534744:0.121549845:0.8724180132737893:71542:0.024096385:20:20

where we have two SNPs: chr1:896680_A_G and chr1:896680_G_T.

This can potentially be fixed by +fixref from bcftools, but it is unclear what will happen to our data in the format field.
Alternatively, this issue force us to produce a high-quality TARGET file, with only 1 REF for each position and used that to serve as our templates.

The sumstat merger is otherwise error-free

@hsun3163
Copy link
Collaborator Author

hsun3163 commented Apr 1, 2022

Fortunately, at the moment, all instance of multiple-ref are in the similar format as shown above, I.e. there were at least one bp shared between snps.

This issue may come from the following setup in our vcf_qc module.

# when incorrect or missing REF allele is encountered: warn (w), no left normalization is done.
    bcftools norm -d exact -N --check-ref w -f ${reference_genome}  -Oz --threads ${numThreads} |\

The problem should disappear once we have a good target file that have only 1 ref


Two things need to be done:

  • Fix vcf_qc to make this issue disappeared: For future user
    This should be simple enough based on following, test pending

-c, --check-ref e|w|x|s
what to do when incorrect or missing REF allele is encountered: exit (e), warn (w), exclude (x), or set/fix (s) bad sites. The w option can be combined with x and s. Note that s can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also note, and this cannot be stressed enough, that s will NOT fix strand issues in your VCF, do NOT use it for that purpose!!! (Instead see http://samtools.github.io/bcftools/howtos/plugin.af-dist.html and <http://samtools.github.io/bcftools/howtos/plugin.fixref.html>.)

  • Create the correct sumstat reference: For us, cuz we don't want to spend couple days to redo all the genotype processing

Sorry, something went wrong.

@hsun3163
Copy link
Collaborator Author

hsun3163 commented Apr 6, 2022

After introducing stand alone TARGET file, this problem is fixed.

@hsun3163 hsun3163 closed this as completed Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant