A database for efficient storage and management of multi panel SNP data

Groeneveld, E.; Truong, C. V. C.

doi:https://doi.org/10.7482/0003-9438-56-103

Articles | Volume 56, issue 1

https://doi.org/10.7482/0003-9438-56-103

© Author(s) 2013. This work is distributed under
the Creative Commons Attribution 3.0 License.

https://doi.org/10.7482/0003-9438-56-103

© Author(s) 2013. This work is distributed under
the Creative Commons Attribution 3.0 License.

Articles | Volume 56, issue 1

20 Nov 2013

| 20 Nov 2013

A database for efficient storage and management of multi panel SNP data

E. Groeneveld and C. V. C. Truong

Abstract. The fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing immense data handling issues. A system design and proof of concept implementation are presented which provides efficient data storage and manipulation of single nucleotide polymorphism (SNP) genotypes in a relational database. A new strategy using SNP and individual selection vectors allows us to view SNP data as matrices or sets. These genotype sets provide an easy way to handle original and derived data, the latter at basically no storage costs. Due to its vector based database storage, data imports and exports are much faster than those of other SNP databases. In the proof of concept implementation, the compressed storage scheme reduces disk space requirements by a factor of around 300. Furthermore, this design scales linearly with number of individuals and SNPs involved. The procedure supports panels of different sizes. This allows a straight forward management of different panel sizes in the same population as it occurs in animal breeding programs when higher density panels replace previous lower density versions.