Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wojciech Matejuk #11

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Wojciech Matejuk #11

wants to merge 5 commits into from

Conversation

WojciechMat
Copy link

@WojciechMat WojciechMat commented Jul 30, 2023

I have decided to create my first pull request after finishing the first stage of the project for you to have more time to review it and for me to get the feedback as soon as possible.

Speed

1. Example of a chart created with speed.py (Notes per minute):




2. Number of notes pressed simultaneously:
The number on the beginning of a title says what time threshold was used in seconds.







The charts show the number of notes pressed simultaneously within different time thresholds. It's interesting to note that the number of simultaneous notes does not always increase with a larger time threshold, which suggests the need for a different grouping method to achieve more accuracy.

Using functions I developed while creating this program, I processed the 'roszcz/maestro-v1' database and found that the piece with the 15 seconds of the fastest music is 'Etude Op. 25 No. 10 in B Minor' by Frédéric Chopin, containing 483 notes in the fastest 15-second timeframe. The algorithm took approximately 32 seconds to search through the database and create a CSV file containing the info on the fastest 15-second timeframe for each file.

Chords

Chords are recognized from groups of notes playing together.

1. Examples of a chart created with chords.py:







2. I created a table of most commonly repeated chord in each of the pieces. Please note that a single chart for each piece can be created in a future pull request.
full version: most-played-chords.csv

id composer title chord repetitions
0 Alban Berg Sonata Op. 1 DM 71
1 Alban Berg Sonata Op. 1 F#M 68
2 Alban Berg Sonata Op. 1 F#M 58
... ... ... ... ...
99 Johannes Brahms Piano Concerto No. 1 in D Minor, Op. 15 AM 57
100 Johannes Brahms Piano Sonata No. 2 in F-Sharp Minor, Op. 2 F#M 72
101 Johannes Brahms Rhapsody in B Minor, Op. 79, No. 1 F#M 90
... ... ... ... ...

The implementation could be optimized due to several conversions. Iterating through the entire database took approximately 25 minutes. The algorithm concluded that the piece with the highest number of repetitions of one chord is:
Composer: Johannes Brahms
Title: Variations on a Theme by Paganini, Op. 35, Volumes 1 & 2
Chord: EM
Repetitions: 783

Optimizing Chord Recognition Algorithm:

After optimizing the chord recognition algorithm, the processing time has been significantly reduced to approximately 15 minutes from the previous 25 minutes. The piece with the most single chord repetitions is as follows:

  • Composer: Franz Schubert
  • Title: Sonata in D Major, D850
  • Chord: AM (A minor)
  • Repetitions: 805

Finding the false-similars in a dataset

After finding inconsistent data in the dataset I got curious as to what can I do with it.
The approach involved using the Levenshtein distance as a distance measure, which has proven effective in finding similar music pieces in large MIDI databases (Guangyu Xia, Tongbo Huang, Yifei Ma, Roger Dannenberg, and Christos Faloutsos, "MidiFind: Fast and Effective Similarity Searching in Large MIDI Databases," School of Computer Science, Carnegie Mellon University).

1. Distance measure.

The Levenshtein distance was used as the distance measure due to its effectiveness and ease of implementation. It allows for the comparison of musical pieces represented as strings of notes with symbols to represent each note.

2. Feature extraction.

For the Levenshtein distance computation, each piece was converted into a string representation. Notes with the same pitch, but at different octaves, were treated as the same note (e.g., C4 and C5 were considered the same). Time was not considered in the computation as it would have reduced the algorithm's effectiveness and efficiency, as suggested in the earlier mentioned paper.

3. Results.

The program saved the results as a CSV file containing indexes of recordings in the original database along with the composer and title of the compared pieces. Additionally, a CSV file was generated with all repeated pieces, including the "are similar" column, indexes, composer, and title.

CSV Files:

Here are the links to the CSV files with inconsistent data and repeated pieces:

  1. Inconsistent Data: inconsistent-data.csv
  2. Repeated Pieces: multiples.csv

Presentation Tables:

Below are tables presenting the data from the CSV files in a clearer format:

Table: Inconsistent Data

FIELD1 index_1 index_2 title
1 0 2 Alban Berg, Sonata Op. 1
2 1 2 Alban Berg, Sonata Op. 1
7 19 20 Alexander Scriabin, Sonata No. 5, Op. 53
... ... ... ...
164 150 153 Franz Liszt, Mephisto Waltz No. 1
... ... ... ...

Table: all repeated pieces

FIELD1 are similar index_1 index_2 title
0 True 0 1 Alban Berg, Sonata Op. 1
1 False 0 2 Alban Berg, Sonata Op. 1
2 False 1 2 Alban Berg, Sonata Op. 1
... ... ... ... ...
118 True 100 105 Felix Mendelssohn, Variations Serieuses, Op. 54
... ... ... ... ...

The inconsistent pieces can be listened to by using play_rec_no function in process_data.py

goodnight.

@roszcz
Copy link
Member

roszcz commented Jul 31, 2023

Thanks @WojciechMat, that's great!

I'm having some issues running your chords.py file:

  File "/home/hagrid/workson/midi-internship/src/chords.py", line 54, in MidiPiece_to_MidiFile
    mido_obj = MidiFile("res.mid", ticks_per_beat=480)
  File "/home/hagrid/.virtualenvs/midi-internship/lib/python3.9/site-packages/miditoolkit/midi/parser.py", line 66, in __init__
    self.instruments = self._load_instruments(mido_obj)
  File "/home/hagrid/.virtualenvs/midi-internship/lib/python3.9/site-packages/miditoolkit/midi/parser.py", line 205, in _load_instruments
    current_instrument = np.zeros(16, dtype=np.int)
  File "/home/hagrid/.virtualenvs/midi-internship/lib/python3.9/site-packages/numpy/__init__.py", line 313, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

It looks like it's related to the miditoolkit library not being up to date with numpy. Are you using a <1.20 version of numpy locally?

Another way around this problem would be not to do the MidiPiece -> PrettyMIDI -> miditoolkit conversion, and create the list of Note objects directly:

from miditoolkit.midi.containers import Note


def piece_to_notes(piece: ff.MidiPiece) -> list[Note]:
    notes = []
    columns = ["start", "end", "pitch", "velocity"]
    for it, row in piece.df[columns].iterrows():
        note = Note(**row.to_dict())
        notes.append(note)

    return notes

But I see that you're also doing some time conversion, so there would be more changes required.

@WojciechMat
Copy link
Author

WojciechMat commented Jul 31, 2023

@roszcz Thank you for the feedback!

Actually, I have changed miditoolkit/midi/parser.py file in my miditoolkit library to use int instead of np.int, so it could be used with numpy >1.20, you can do the same if you want to use newest numpy version. Or you can wait for me to update the code to delete unnecessary conversions.

I used unnecessary conversions because I was getting to know chorder library and it’s possibilities. I will use a better way of parsing notes soon.

@roszcz
Copy link
Member

roszcz commented Jul 31, 2023

It's interesting to see the discrepancies in results for the same piece in the csv you shared:

Franz Schubert	Sonata in A Min.	CM	143
Franz Schubert	Sonata in A Min.	CM	260
Franz Schubert	Sonata in A Min.	EM	135
Franz Schubert	Sonata in A Min.	EM	189

For the same piece, I think we should expect the number of chord to be very close.

The detected chord name is also interesting - I'm guessing that maybe there's some issue with Major and Minor chord names - CMajor7 (C E G H) could be easily confused with EMinor (E G H), but not with EMajor (E G# H). From reading your code, it looks like this is the full chorder output, so I'm not sure where is the confusion coming from.

I'll try to take a more detailed look later today :)

@WojciechMat
Copy link
Author

@roszcz I have listened to those four pieces saved as Sonata in A Min (position 280, 281, 282 and 283 in database["train"+"test"+"validation"], and it seems like these are totally different pieces, so that is probably where the discrepances come from.
I have also removed unnecessary to and from file conversion from midipiece_to_midifile function and it now takes only 15 minutes to iterate through the database. I will commit the changes in the evening and will try to see if the database is wrong or if I am naming the pieces incorectlly.

@roszcz
Copy link
Member

roszcz commented Jul 31, 2023

This sort of makes sense - "sonata" is a musical structure consisting of 3 parts, that are often performed separately, which could explain those differences. It's also great to know that there's this inconsistency in the maestro dataset, thanks!

@WojciechMat WojciechMat changed the title Wojciech Matejuk: easy part Wojciech Matejuk Jul 31, 2023
@WojciechMat
Copy link
Author

WojciechMat commented Aug 1, 2023

@roszcz I have updated Pull Request description and am waiting for your feedback. I will use this stage in the task of finding similarities. Have a great day.

Ps. I plan on adding visual representation of files in the future.

src/chords.py Outdated
)


def process_dataset(dataset, csv_file_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def process_dataset(dataset, csv_file_path):
def process_dataset(dataset: Dataset) -> pd.DataFrame:

I think it's a good practice to separate data processing logic from any data presentation/visualization procedures, so the csv creation could be moved to the main part:

df = process_dataset(dataset)
df.to_csv(csv_file_path)

I also included type hints in my suggestion - these are very helpful in maintaining readability :)

src/chords.py Outdated
# Pop notes which ended before current note started
if pq.empty():
continue
while note.start > pq.queue[0].end:
Copy link
Member

@roszcz roszcz Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm reading this correctly, but it looks like as long as the first note of the sequence is being hold, everything after it is going to be considered part of the chord.

For example, this gets recognized as CM:

   pitch  start  end  duration  velocity
0     60      0    3         3        80
1     64      1    3         2        80
2     67      2    3         1        80

But there is one second distance between each of the notes, so the expected output would be "no chords".

The original intention was to find chords defined as notes performed simultaneously - the issue here being that in reality those types of chords are spread over (short amount of) time - either due to aesthetic choices of the player, or just because it's almost impossible to press multiple keys at the exactly same time with millisecond precision :)

Your logic can be adjusted for those requirements with some kind of thresholding for the maximum allowed duration of the chord:

while pq.queue[0].end - note.start  < threshold:

src/chords.py Outdated
Comment on lines 48 to 51
midi_data = piece.df
piece = piece.to_midi()
midi_data["start"] = midi_data.apply(start_to_ticks, axis=1, args=(piece,))
midi_data["end"] = midi_data.apply(end_to_ticks, axis=1, args=(piece,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
midi_data = piece.df
piece = piece.to_midi()
midi_data["start"] = midi_data.apply(start_to_ticks, axis=1, args=(piece,))
midi_data["end"] = midi_data.apply(end_to_ticks, axis=1, args=(piece,))
midi_data = piece.df.copy()
piece = piece.to_midi()
midi_data["start"] = midi_data.apply(start_to_ticks, axis=1, args=(piece,))
midi_data["end"] = midi_data.apply(end_to_ticks, axis=1, args=(piece,))

Your version modifies the internal dataframe of the piece, which could lead to serious confusion if the same object had to be used for any other processing after chord detection. Making a copy is a safe way to prevent those problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

@WojciechMat
Copy link
Author

WojciechMat commented Aug 1, 2023

I have incorporated your suggestions and added functionality.

Chords

Chord recognition

I decided to use 24 ticks as a threshold for maximum chord duration. The results of finding most repeated chord in databes have changed and are more consistent with the results in multiples.csv

Most repeated chords

Here are new results of finding the most repeated chord for each piece:

most-played-chords-v2.csv

The piece with the most repetitions is:
composer: Franz Schubert
title: Sonata in D Major, D850
chord: DM
repetitions 1582

Chord tables

Here is an example of chord table created for Edvard Erieg, Lyric piece in e minor, “waltz,” op 38 no 7:



csv files:

franz-liszt-dante-sonata.csv

franz-schubert-sonata-in-d-major,-d850.csv

Finding inconsistent data - notes

I have updated similarity threshold. The algorithm does result in some false-positives (pieces that are the same are sometimes flagged as not similar ), but it can help in renaming the data.

The method with Leveshtein distance does not recognize the same pieces in different keys as similar.

new csv files:
multiples.csv

inconsistent-data.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants