See Also : Intra African Genome-Wide Analysis, V1
Finally got some more badly needed genome-wide data from East Africa. 12 sets of populations were added, 9 Afroasiatic (3 Omotic, 4 Cushitc, 2 Semitic) and 3 Nilo Saharan.
Finally got some more badly needed genome-wide data from East Africa. 12 sets of populations were added, 9 Afroasiatic (3 Omotic, 4 Cushitc, 2 Semitic) and 3 Nilo Saharan.
I updated my Africa reference map and table below where the newer populations are to be found indexed from 46-57,
In addition the data was merged with the older dataset, the bad news is that the genotyping rate for all the 26,129 SNPs dropped by about 7% to 92.4%, the good news off-course is that the data I was eagerly anticipating, especially Nilotic from South Sudan and Omotics from Ethiopia are now available.
When I re-run the model-based analysis with the same settings, i.e ADMIXTURE K10, the major shifts in the cluster allocations were that the Mbuti and Biaka Pygmy clusters combined and formed one Pygmy cluster, the West-Central African cluster disappeared, and in their place a Nilotic and an Omotic cluster were formed. There were quite major shifts in the ADMIXTURE proportions for all the populations except South AFRICA, including the FST distances where the previous major East African cluster (East Africa 2) is shifted much closer to the North African cluster:
This is also seen in the ADMIXTURE proportions where the East African proportion in North Africans is sgnificantly higher. I will look to update this post with more analysis but for now:
UPDATE:
Had a chance to rerun the exact same intra-African dataset as above, but this time for K=2-10, while at the same time checking for the Cross Validation Error values:
K, CV Error
1 0.58753
2 0.56519
3 0.55874
4 0.55554
5 0.55379
6 0.55315
7 0.55269
8 0.55239
9 0.55215
10 0.55201
As can be seen, the CV Error is still decreasing, meaning I still have some room to go in my K selection beyond K=10 for this Dataset.
I have uploaded the full set of results and processed output (mean, median, standard deviation) for anybody that may be interested here, but since I do not have time to plot out each K's results like I did for K10 earlier, I will post the peaking population breakdowns for each K run as my program tells me, as well as the Median Values for 3 selected populations: EtA-P (26), ARI-B (17) and South-Sudan (24):
K2– East and North Africans split from other Africans
Cluster1: morocco-n,egypt,egyptans,libya,algeria
Cluster2: pygmy,mbutipygmy,biakapygmy,kongo,yoruba
EtA-P
Cluster1 71.94% morocco-n
Cluster2 28.06% pygmy
ARI-B
Cluster2 61.06% pygmy
Cluster1 38.94% morocco-n
South-Sudan
Cluster2 88.06% pygmy
Cluster1 11.94% morocco-n
K3– West Africans and also Nilotes split from San/Pygmy (Hunter Gatherers)
Cluster1: morocco-n,egypt,egyptans,libya,algeria
Cluster2: yoruba,dogon,bambaran,igbo,brong
Cluster3: san-nb,san,pygmy,mbutipygmy,!kung
EtA-P
Cluster1 71.22% morocco-n
Cluster2 20.82% yoruba
Cluster3 8.12% san-nb
ARI-B
Cluster1 38.60% morocco-n
Cluster2 36.05% yoruba
Cluster3 24.05% san-nb
South-Sudan
Cluster2 81.62% yoruba
Cluster1 10.20% morocco-n
Cluster3 8.04% san-nb
K4- Nilotes and Omotic Split off
Cluster1:Gumuz,ARI-B,ARI-C,Anuak,South-Sudan
Cluster2: morocco-n,egypt,egyptans,libya,algeria
Cluster3: san-nb,san,pygmy,mbutipygmy,!kung
Cluster4: yoruba,dogon,brong,igbo,bambaran
EtA-P
Cluster2 55.77% morocco-n
Cluster1 42.46% Gumuz
Cluster3 1.39% san-nb
ARI-B
Cluster1 70.01% Gumuz
Cluster2 18.77% morocco-n
Cluster3 10.01% san-nb
South-Sudan
Cluster1 63.62% Gumuz
Cluster4 35.89% yoruba
K5- Pygmies Split off
Cluster1: pygmy,mbutipygmy,biakapygmy,alur,fang
Cluster2: san-nb,san,!kung,xhosa,sotho/tswana
Cluster3: Gumuz,ARI-B,ARI-C,Anuak,South-Sudan
Cluster4: yoruba,dogon,brong,igbo,bambaran
Cluster5: morocco-n,egypt,libya,egyptans,algeria
EtA-P
Cluster5 55.31% morocco-n
Cluster3 41.99% Gumuz
Cluster2 2.10% san-nb
ARI-B
Cluster3 69.58% Gumuz
Cluster5 18.18% morocco-n
Cluster2 9.93% san-nb
South-Sudan
Cluster3 62.81% Gumuz
Cluster4 35.46% yoruba
Cluster1 2.40% pygmy
K6- Hadza Split off
Cluster1: Gumuz,ARI-B,Anuak,South-Sudan,ARI-C
Cluster2: yoruba,dogon,brong,igbo,bambaran
Cluster3: morocco-n,egypt,libya,egyptans,algeria
Cluster4:hadza,ARI-B,sandawe,ARI-C,Gumuz
Cluster5: pygmy,mbutipygmy,biakapygmy,alur,fang
Cluster6: san-nb,san,!kung,xhosa,sotho/tswana
EtA-P
Cluster3 54.93% morocco-n
Cluster1 41.19% Gumuz
Cluster6 1.94% san-nb
Cluster4 1.77% hadza
ARI-B
Cluster1 63.88% Gumuz
Cluster3 17.74% morocco-n
Cluster6 8.12% san-nb
Cluster4 8.10% hadza
South-Sudan
Cluster1 63.97% Gumuz
Cluster2 33.43% yoruba
Cluster5 3.00% pygmy
K7- Omotic Cluster forms
Cluster1: morocco-n,egypt,libya,egyptans,algeria
Cluster2: South-Sudan,Anuak,Gumuz,maasai,bulala
Cluster3: hadza,sandawe,Gumuz,ARI-C,maasai
Cluster4: san-nb,san,!kung,xhosa,sotho/tswana
Cluster5: pygmy,mbutipygmy,biakapygmy,alur,fang
Cluster6: yoruba,igbo,brong,dogon,bambaran
Cluster7:ARI-B,ARI-C,Wolayta,Gumuz,EtO-P
EtA-P
Cluster1 51.56% morocco-n
Cluster2 27.86% South-Sudan
Cluster7 18.49% ARI-B
ARI-B
Cluster7 98.05% ARI-B
South-Sudan
Cluster2 69.00% South-Sudan
Cluster6 22.88% yoruba
Cluster5 3.72% pygmy
Cluster7 3.51% ARI-B
Cluster3 1.10% hadza
K8- Eastern Bantu Cluster forms
Cluster1: ARI-B,ARI-C,Wolayta,sandawe,Gumuz
Cluster2: South-Sudan,Anuak,Gumuz,maasai,bulala
Cluster3: dogon,mandenka,bambaran,brong,yoruba
Cluster4: pygmy,mbutipygmy,biakapygmy,alur,fang
Cluster5: morocco-n,egypt,libya,egyptans,mozabite
Cluster6: luhya,biakapygmy,bantukenya,nguni,pedi
Cluster7: san-nb,san,!kung,xhosa,sotho/tswana
Cluster8: hadza,sandawe,Gumuz,ARI-C,maasai
EtA-P
Cluster5 49.89% morocco-n
Cluster2 25.81% South-Sudan
Cluster1 21.96% ARI-B
ARI-B
Cluster1 96.85% ARI-B
South-Sudan
Cluster2 69.36% South-Sudan
Cluster3 22.68% dogon
Cluster4 4.75% pygmy
Cluster8 1.14% hadza
K9- East Africa2 cluster forms
Cluster1: san-nb,san,!kung,xhosa,sotho/tswana
Cluster2: Somali,EtS-P,maasai,Afar,EtO
Cluster3: pygmy,mbutipygmy,biakapygmy,alur,fang
Cluster4: South-Sudan,Anuak,Gumuz,bulala,alur
Cluster5: hadza,sandawe,ARI-C,Gumuz,maasai
Cluster6: dogon,mandenka,bambaran,brong,yoruba
Cluster7: ARI-B,ARI-C,Wolayta,sandawe,EtO-P
Cluster8: biakapygmy,luhya,bantukenya,pedi,nguni
Cluster9: morocco-n,mozabite,egypt,libya,algeria
EtA-P
Cluster9 40.82% morocco-n
Cluster2 29.92% Somali
Cluster7 24.72% ARI-B
Cluster4 3.17% South-Sudan
ARI-B
Cluster7 59.24% ARI-B
Cluster2 16.46% Somali
Cluster4 8.44% South-Sudan
Cluster9 6.70% morocco-n
Cluster1 4.07% san-nb
Cluster5 3.16% hadza
Cluster3 2.15% pygmy
South-Sudan
Cluster4 81.33% South-Sudan
Cluster6 8.64% dogon
Cluster3 2.82% pygmy
Cluster8 1.65% biakapygmy
K10- East Africa 1 cluster forms
Cluster1: hadza,sandawe,ARI-C,Gumuz,EtO
Cluster2: dogon,mandenka,bambaran,brong,yoruba
Cluster3: Somali,EtS-P,Afar,EtT-P,EtA-P
Cluster4: maasai,sandawe,EtO,hema,ethiopian-jews
Cluster5: pygmy,mbutipygmy,biakapygmy,alur,fang
Cluster6: luhya,bantukenya,nguni,pedi,bantusouthafrica
Cluster7: ARI-B,ARI-C,biakapygmy,Wolayta,Gumuz
Cluster8: san-nb,san,!kung,xhosa,sotho/tswana
Cluster9: South-Sudan,Anuak,Gumuz,bulala,alur
Cluster10: mozabite,morocco-n,sahara-occ,algeria,moroccans
EtA-P
Cluster3 52.46% Somali
Cluster10 19.53% mozabite
Cluster9 13.35% South-Sudan
Cluster7 5.88% ARI-B
Cluster4 3.10% maasai
Cluster1 1.80% hadza
ARI-B
Cluster7 94.69% ARI-B
Cluster3 1.32% Somali
South-Sudan
Cluster9 79.74% South-Sudan
Cluster2 10.27% dogon
Cluster6 4.45% luhya
Cluster5 2.83% pygmy
UPDATE2: MDS plots based on average coordinates of populations;
![]() |
Isometric |
![]() |
C1vsC2 |
![]() |
C1vsC3 |
UPDATE3: Removing Outliers.
Based on the previous K10 ADMIXTURE run, I used a statistical outlier removing method to extract the more homogeneous samples from the N=1300 dataset. The method looked to remove samples with a studentization value > 2, this method subtracts the mean of a cluster proportion in a given population from each sample within the population and divides this value, also known as a residual, by the standard deviation of the cluster for the population to arrive at the studentization value for each sample.
Applying this method removed 392 individuals from across the dataset. In addition, the Pagani publication had identified 13 Samples that were potentially related using a PLINK identity-by-descent score of >= 0.125, I also removed those individuals, this left me with a new outlier removed dataset of 895 samples.
Rerunning this new dataset using ADMIXTURE, K=10, reintroduced the Biaka pygmy cluster that had appeared in my Version one run before the appearance of the new East African samples from Pagani et. al. The reappearance of the Biaka Pygmy cluster however meant that the Eastern Bantu cluster was no longer recognized.
The full results from this new run can be downloaded from here, and the summary can be found below;
UPDATE4 (July 19 2012): Scenarios for different Eurasian proxies.
Here, I took the outlier removed dataset from my last run (V2b) and appended to it 3 different Eurasian proxies and run ADMIXTURE separately, in the first scenario I added the French from HGDP, in the second scenario I added the Palestinians from HGDP and finally in the third scenario I added the Japanese from Hapmap. In all three scenarios, the cluster distributions as well as relative Fst Distances of the African components change.
Scenario 1 (+French)
In this scenario, adding the French removes both the East Africa 1 and 2 clusters and adds the French cluster while reintroducing the Eastern Bantu cluster. The relative Fst distances (seen below), shifts the omotic cluster closer to the French and North African Clusters, in other words the omotic cluster behaves as a proxy for the East African1 and 2 clusters that were present before the addition of the French.
Scenario 2 (+Palestinians)
In this Scenario, A Palestinian cluster along with the Eastern Bantu cluster appear, while the East Africa1 cluster remains, and where East Africa2 and the North Africa clusters disappear. The Fst differentiates both the Hadza and Palestinians on the first principal component, unlike with the addition of the French, where the first PC differentiated mostly the French and North African cluster, this is likely because Palestinians are closer to Africans than the French are.
Scenario 3 (+Japanese)
Here, when adding a Eurasian proxy that is furthest geographically removed from Africa, the East Africa 2 cluster disappears to make way for the Japanese cluster. All other clusters remain in their previously peaking populations. As this approximates a global analysis, the first PC of the Fst distances separates solely the Japanese from all Africans, where the North African cluster occupies a slightly intermediate (albeit closer to Africans) position.
The mean frequencies for all 3 scenarios (plus the original V2b run) and the sampled populations can be found below.