Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shorten rownames #73

Open
LisaHollstein opened this issue Jan 17, 2022 · 5 comments
Open

shorten rownames #73

LisaHollstein opened this issue Jan 17, 2022 · 5 comments

Comments

@LisaHollstein
Copy link
Contributor

The rownames in the output csv's as well as the bacteria names in the heatmaps are very long.
e.g

1_AE015929_1_Staphylococcus_epidermidis_ATCC_12228__BAC
1_CP007601_1_Staphylococcus_capitis_subsp__capitis_strain_AYP1020__BAC
1_AP011540_1_Rothia_mucilaginosa_DY_18_DNA__BAC

they could be shortened to something like

Staphylococcus epidermis
Staphylococcus capitis subsp capitis
Rothis mucilaginosa

@LisaHollstein
Copy link
Contributor Author

LisaHollstein commented Jan 17, 2022

@colindaven

I wrote some code that does the following:

  1. extract the species name (and subspecies) from the long name
  2. rename the rows in the table
  3. save as csv (same name as before, but with the extension "short")
  4. save a table with information on the initiall name and the newer short name
  5. if a species is multiple times in the table (e.g. multiple chromosomes and one row for each chromosome) the values of the rows are summed (and a warning that this has been done is issued)

@colindaven
Copy link
Contributor

  • OK, good. Subspecies should hopefully not be present too often
  • Please use underscore "_" between words, makes it easier to code for in different languages
  • The extra table with "_short.csv" as extension sounds good
  • Values for the table summed. This is not appropriate, or only for non-normalized data like read counts. Normalized data like bact per human cell should be averaged (better median, but mostly have only two data points).

Thanks for this, I'll look forward to the implementation and PR

@LisaHollstein
Copy link
Contributor Author

LisaHollstein commented Jan 19, 2022

Okay, the rows (usually) aren't summed anymore.

In haybaler.py the "_short.csv" is only created if there aren't multiple rows for one species. Only for the read count table the rows are still summed and a "_short.csv" is created.

In shorten_names.R the rows just keep their old, long names

@LisaHollstein
Copy link
Contributor Author

I just notized a problem:

The read_count_short table is in different oder than the normal read_count table. I think the easiest way is to just don't output any csv with short names at all, if the short names aren't unique.

@colindaven
Copy link
Contributor

colindaven commented Jan 24, 2022

It would be nice to have a test for this problem too. Even just a line count of the two datasets, and if they're not the same output an error.

Short names look good for me so far.

species                               chr_length  gc_ref  Umwelt2_1_S20_R1  Umwelt2_2_S21_R1  Umwelt2_3_S93_R1  Umwelt2_4_S94_R1  Umwelt2_5_S95_R1
Moraxella_osloensis                   2434688.0   43.85   34502.64          153977.36         0.0               0.0               1293.59
Paracoccus_yeei                       3622127.0   67.18   8502.91           26258.12          0.0               0.0               0.0
Cutibacterium_acnes                   2522438.0   59.99   16523.65          13796.62          0.0               5061.46           11111.11



haybaler/control_dataset/haybaler_output$ wc -l *.csv
   160 bacteria_per_human_cell_haybaler.csv
   160 bacteria_per_human_cell_haybaler_short.csv
   154 excluded_taxa.csv
   160 read_count_haybaler.csv
   160 read_count_haybaler_short.csv
   160 reads_per_million_reads_in_experiment_haybaler.csv
   160 reads_per_million_reads_in_experiment_haybaler_short.csv
   160 reads_per_million_ref_bases_haybaler.csv
   160 reads_per_million_ref_bases_haybaler_short.csv
   160 RPMM_haybaler.csv
   160 RPMM_haybaler_short.csv


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants