shorten rownames #73

LisaHollstein · 2022-01-17T13:08:26Z

The rownames in the output csv's as well as the bacteria names in the heatmaps are very long.
e.g

1_AE015929_1_Staphylococcus_epidermidis_ATCC_12228__BAC
1_CP007601_1_Staphylococcus_capitis_subsp__capitis_strain_AYP1020__BAC
1_AP011540_1_Rothia_mucilaginosa_DY_18_DNA__BAC

they could be shortened to something like

Staphylococcus epidermis
Staphylococcus capitis subsp capitis
Rothis mucilaginosa

LisaHollstein · 2022-01-17T13:19:11Z

@colindaven

I wrote some code that does the following:

extract the species name (and subspecies) from the long name
rename the rows in the table
save as csv (same name as before, but with the extension "short")
save a table with information on the initiall name and the newer short name
if a species is multiple times in the table (e.g. multiple chromosomes and one row for each chromosome) the values of the rows are summed (and a warning that this has been done is issued)

colindaven · 2022-01-17T15:22:50Z

OK, good. Subspecies should hopefully not be present too often
Please use underscore "_" between words, makes it easier to code for in different languages
The extra table with "_short.csv" as extension sounds good
Values for the table summed. This is not appropriate, or only for non-normalized data like read counts. Normalized data like bact per human cell should be averaged (better median, but mostly have only two data points).

Thanks for this, I'll look forward to the implementation and PR

LisaHollstein · 2022-01-19T13:15:33Z

Okay, the rows (usually) aren't summed anymore.

In haybaler.py the "_short.csv" is only created if there aren't multiple rows for one species. Only for the read count table the rows are still summed and a "_short.csv" is created.

In shorten_names.R the rows just keep their old, long names

LisaHollstein · 2022-01-20T09:20:14Z

I just notized a problem:

The read_count_short table is in different oder than the normal read_count table. I think the easiest way is to just don't output any csv with short names at all, if the short names aren't unique.

colindaven · 2022-01-24T14:43:13Z

It would be nice to have a test for this problem too. Even just a line count of the two datasets, and if they're not the same output an error.

Short names look good for me so far.

species                               chr_length  gc_ref  Umwelt2_1_S20_R1  Umwelt2_2_S21_R1  Umwelt2_3_S93_R1  Umwelt2_4_S94_R1  Umwelt2_5_S95_R1
Moraxella_osloensis                   2434688.0   43.85   34502.64          153977.36         0.0               0.0               1293.59
Paracoccus_yeei                       3622127.0   67.18   8502.91           26258.12          0.0               0.0               0.0
Cutibacterium_acnes                   2522438.0   59.99   16523.65          13796.62          0.0               5061.46           11111.11



haybaler/control_dataset/haybaler_output$ wc -l *.csv
   160 bacteria_per_human_cell_haybaler.csv
   160 bacteria_per_human_cell_haybaler_short.csv
   154 excluded_taxa.csv
   160 read_count_haybaler.csv
   160 read_count_haybaler_short.csv
   160 reads_per_million_reads_in_experiment_haybaler.csv
   160 reads_per_million_reads_in_experiment_haybaler_short.csv
   160 reads_per_million_ref_bases_haybaler.csv
   160 reads_per_million_ref_bases_haybaler_short.csv
   160 RPMM_haybaler.csv
   160 RPMM_haybaler_short.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shorten rownames #73

shorten rownames #73

LisaHollstein commented Jan 17, 2022

LisaHollstein commented Jan 17, 2022 •

edited

Loading

colindaven commented Jan 17, 2022

LisaHollstein commented Jan 19, 2022 •

edited

Loading

LisaHollstein commented Jan 20, 2022

colindaven commented Jan 24, 2022 •

edited

Loading

shorten rownames #73

shorten rownames #73

Comments

LisaHollstein commented Jan 17, 2022

LisaHollstein commented Jan 17, 2022 • edited Loading

colindaven commented Jan 17, 2022

LisaHollstein commented Jan 19, 2022 • edited Loading

LisaHollstein commented Jan 20, 2022

colindaven commented Jan 24, 2022 • edited Loading

LisaHollstein commented Jan 17, 2022 •

edited

Loading

LisaHollstein commented Jan 19, 2022 •

edited

Loading

colindaven commented Jan 24, 2022 •

edited

Loading