Skip to content
This repository has been archived by the owner on Oct 17, 2024. It is now read-only.

Possible bug in "chrom" field #3

Open
EricR86 opened this issue Jun 7, 2024 · 1 comment
Open

Possible bug in "chrom" field #3

EricR86 opened this issue Jun 7, 2024 · 1 comment

Comments

@EricR86
Copy link
Collaborator

EricR86 commented Jun 7, 2024

Hello,

Recently there was some work with BED files and RefSeq/Genbank chromosome IDs which typically have a period in them for versioning purposes (e.g. "NC_000001.11"). This is currently not allowed as-is in the spec. Only alphanumeric characters are allowed.

I e-mailed Jim Kent regarding this issue and this is what he had to say:
"Yes, I would consider this an error. All of our parsers are good with anything but white space there. Most of our utilities will handle spaces if you throw in a -tab option, but I wouldn't want to encourage that."

@EricR86
Copy link
Collaborator Author

EricR86 commented Jul 2, 2024

There was another response from UCSC. Matthew Speir had this to say:

In short, we think periods should be allowed in an update to the BED specification...
bigBed, bigWig, and other big* formats similarly don't have restrictions on using periods in the chrom field.

The details and initial reasoning come from specifically an engineer there named Angie Hinrichs:

When we exclusively used MySQL for storage (before bigBed, etc), we split some of our largest tracks into a table per chromosome. For example, instead of a single table "xenoMrna" there would be separate tables chr1_xenoMrna, chr2_xenoMrna and so on. This meant only characters that could be used in MySQL table names without special quoting could be used for the chrom field, because they might end up as prefixes in mysql table names. As I'm sure you know, '.' has special meaning in SQL as a separator between database, table, and field.

However, we had to stop using "split tables" when we added new organisms whose assemblies consisted of tens of thousands or even hundreds of thousands of scaffold sequences -- that would just be way too many MySQL tables. That restriction still applied to old databases with split tables, but not to new databases after a certain point.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant