Wojciech Matejuk #11

WojciechMat · 2023-07-30T22:21:36Z

I have decided to create my first pull request after finishing the first stage of the project for you to have more time to review it and for me to get the feedback as soon as possible.

Speed

1. Example of a chart created with speed.py (Notes per minute):

2. Number of notes pressed simultaneously:
The number on the beginning of a title says what time threshold was used in seconds.

The charts show the number of notes pressed simultaneously within different time thresholds. It's interesting to note that the number of simultaneous notes does not always increase with a larger time threshold, which suggests the need for a different grouping method to achieve more accuracy.

Using functions I developed while creating this program, I processed the 'roszcz/maestro-v1' database and found that the piece with the 15 seconds of the fastest music is 'Etude Op. 25 No. 10 in B Minor' by Frédéric Chopin, containing 483 notes in the fastest 15-second timeframe. The algorithm took approximately 32 seconds to search through the database and create a CSV file containing the info on the fastest 15-second timeframe for each file.

Chords

Chords are recognized from groups of notes playing together.

1. Examples of a chart created with chords.py:

2. I created a table of most commonly repeated chord in each of the pieces. Please note that a single chart for each piece can be created in a future pull request.
full version: most-played-chords.csv

id	composer	title	chord	repetitions
0	Alban Berg	Sonata Op. 1	DM	71
1	Alban Berg	Sonata Op. 1	F#M	68
2	Alban Berg	Sonata Op. 1	F#M	58
...	...	...	...	...
99	Johannes Brahms	Piano Concerto No. 1 in D Minor, Op. 15	AM	57
100	Johannes Brahms	Piano Sonata No. 2 in F-Sharp Minor, Op. 2	F#M	72
101	Johannes Brahms	Rhapsody in B Minor, Op. 79, No. 1	F#M	90
...	...	...	...	...

The implementation could be optimized due to several conversions. Iterating through the entire database took approximately 25 minutes. The algorithm concluded that the piece with the highest number of repetitions of one chord is:
Composer: Johannes Brahms
Title: Variations on a Theme by Paganini, Op. 35, Volumes 1 & 2
Chord: EM
Repetitions: 783

Optimizing Chord Recognition Algorithm:

After optimizing the chord recognition algorithm, the processing time has been significantly reduced to approximately 15 minutes from the previous 25 minutes. The piece with the most single chord repetitions is as follows:

Composer: Franz Schubert
Title: Sonata in D Major, D850
Chord: AM (A minor)
Repetitions: 805

Finding the false-similars in a dataset

After finding inconsistent data in the dataset I got curious as to what can I do with it.
The approach involved using the Levenshtein distance as a distance measure, which has proven effective in finding similar music pieces in large MIDI databases (Guangyu Xia, Tongbo Huang, Yifei Ma, Roger Dannenberg, and Christos Faloutsos, "MidiFind: Fast and Effective Similarity Searching in Large MIDI Databases," School of Computer Science, Carnegie Mellon University).

1. Distance measure.

The Levenshtein distance was used as the distance measure due to its effectiveness and ease of implementation. It allows for the comparison of musical pieces represented as strings of notes with symbols to represent each note.

2. Feature extraction.

For the Levenshtein distance computation, each piece was converted into a string representation. Notes with the same pitch, but at different octaves, were treated as the same note (e.g., C4 and C5 were considered the same). Time was not considered in the computation as it would have reduced the algorithm's effectiveness and efficiency, as suggested in the earlier mentioned paper.

3. Results.

The program saved the results as a CSV file containing indexes of recordings in the original database along with the composer and title of the compared pieces. Additionally, a CSV file was generated with all repeated pieces, including the "are similar" column, indexes, composer, and title.

CSV Files:

Here are the links to the CSV files with inconsistent data and repeated pieces:

Inconsistent Data: inconsistent-data.csv
Repeated Pieces: multiples.csv

Presentation Tables:

Below are tables presenting the data from the CSV files in a clearer format:

Table: Inconsistent Data

FIELD1	index_1	index_2	title
1	0	2	Alban Berg, Sonata Op. 1
2	1	2	Alban Berg, Sonata Op. 1
7	19	20	Alexander Scriabin, Sonata No. 5, Op. 53
...	...	...	...
164	150	153	Franz Liszt, Mephisto Waltz No. 1
...	...	...	...

Table: all repeated pieces

FIELD1	are similar	index_1	index_2	title
0	True	0	1	Alban Berg, Sonata Op. 1
1	False	0	2	Alban Berg, Sonata Op. 1
2	False	1	2	Alban Berg, Sonata Op. 1
...	...	...	...	...
118	True	100	105	Felix Mendelssohn, Variations Serieuses, Op. 54
...	...	...	...	...

The inconsistent pieces can be listened to by using play_rec_no function in process_data.py

goodnight.

roszcz · 2023-07-31T08:34:43Z

Thanks @WojciechMat, that's great!

I'm having some issues running your chords.py file:

  File "/home/hagrid/workson/midi-internship/src/chords.py", line 54, in MidiPiece_to_MidiFile
    mido_obj = MidiFile("res.mid", ticks_per_beat=480)
  File "/home/hagrid/.virtualenvs/midi-internship/lib/python3.9/site-packages/miditoolkit/midi/parser.py", line 66, in __init__
    self.instruments = self._load_instruments(mido_obj)
  File "/home/hagrid/.virtualenvs/midi-internship/lib/python3.9/site-packages/miditoolkit/midi/parser.py", line 205, in _load_instruments
    current_instrument = np.zeros(16, dtype=np.int)
  File "/home/hagrid/.virtualenvs/midi-internship/lib/python3.9/site-packages/numpy/__init__.py", line 313, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

It looks like it's related to the miditoolkit library not being up to date with numpy. Are you using a <1.20 version of numpy locally?

Another way around this problem would be not to do the MidiPiece -> PrettyMIDI -> miditoolkit conversion, and create the list of Note objects directly:

from miditoolkit.midi.containers import Note


def piece_to_notes(piece: ff.MidiPiece) -> list[Note]:
    notes = []
    columns = ["start", "end", "pitch", "velocity"]
    for it, row in piece.df[columns].iterrows():
        note = Note(**row.to_dict())
        notes.append(note)

    return notes

But I see that you're also doing some time conversion, so there would be more changes required.

WojciechMat · 2023-07-31T08:44:06Z

@roszcz Thank you for the feedback!

Actually, I have changed miditoolkit/midi/parser.py file in my miditoolkit library to use int instead of np.int, so it could be used with numpy >1.20, you can do the same if you want to use newest numpy version. Or you can wait for me to update the code to delete unnecessary conversions.

I used unnecessary conversions because I was getting to know chorder library and it’s possibilities. I will use a better way of parsing notes soon.

roszcz · 2023-07-31T08:58:15Z

It's interesting to see the discrepancies in results for the same piece in the csv you shared:

Franz Schubert	Sonata in A Min.	CM	143
Franz Schubert	Sonata in A Min.	CM	260
Franz Schubert	Sonata in A Min.	EM	135
Franz Schubert	Sonata in A Min.	EM	189

For the same piece, I think we should expect the number of chord to be very close.

The detected chord name is also interesting - I'm guessing that maybe there's some issue with Major and Minor chord names - CMajor7 (C E G H) could be easily confused with EMinor (E G H), but not with EMajor (E G# H). From reading your code, it looks like this is the full chorder output, so I'm not sure where is the confusion coming from.

I'll try to take a more detailed look later today :)

WojciechMat · 2023-07-31T10:06:08Z

@roszcz I have listened to those four pieces saved as Sonata in A Min (position 280, 281, 282 and 283 in database["train"+"test"+"validation"], and it seems like these are totally different pieces, so that is probably where the discrepances come from.
I have also removed unnecessary to and from file conversion from midipiece_to_midifile function and it now takes only 15 minutes to iterate through the database. I will commit the changes in the evening and will try to see if the database is wrong or if I am naming the pieces incorectlly.

roszcz · 2023-07-31T10:42:06Z

This sort of makes sense - "sonata" is a musical structure consisting of 3 parts, that are often performed separately, which could explain those differences. It's also great to know that there's this inconsistency in the maestro dataset, thanks!

WojciechMat · 2023-08-01T00:18:05Z

@roszcz I have updated Pull Request description and am waiting for your feedback. I will use this stage in the task of finding similarities. Have a great day.

Ps. I plan on adding visual representation of files in the future.

roszcz · 2023-08-01T05:49:00Z

src/chords.py

+    )
+
+
+def process_dataset(dataset, csv_file_path):


Suggested change

def process_dataset(dataset, csv_file_path):

def process_dataset(dataset: Dataset) -> pd.DataFrame:

I think it's a good practice to separate data processing logic from any data presentation/visualization procedures, so the csv creation could be moved to the main part:

df = process_dataset(dataset) df.to_csv(csv_file_path)

I also included type hints in my suggestion - these are very helpful in maintaining readability :)

roszcz · 2023-08-01T06:10:22Z

src/chords.py

+        # Pop notes which ended before current note started
+        if pq.empty():
+            continue
+        while note.start > pq.queue[0].end:


I'm not sure if I'm reading this correctly, but it looks like as long as the first note of the sequence is being hold, everything after it is going to be considered part of the chord.

For example, this gets recognized as CM:

pitch start end duration velocity 0 60 0 3 3 80 1 64 1 3 2 80 2 67 2 3 1 80

But there is one second distance between each of the notes, so the expected output would be "no chords".

The original intention was to find chords defined as notes performed simultaneously - the issue here being that in reality those types of chords are spread over (short amount of) time - either due to aesthetic choices of the player, or just because it's almost impossible to press multiple keys at the exactly same time with millisecond precision :)

Your logic can be adjusted for those requirements with some kind of thresholding for the maximum allowed duration of the chord:

while pq.queue[0].end - note.start < threshold:

roszcz · 2023-08-01T06:16:08Z

src/chords.py

+    midi_data = piece.df
+    piece = piece.to_midi()
+    midi_data["start"] = midi_data.apply(start_to_ticks, axis=1, args=(piece,))
+    midi_data["end"] = midi_data.apply(end_to_ticks, axis=1, args=(piece,))


Suggested change

midi_data = piece.df

piece = piece.to_midi()

midi_data["start"] = midi_data.apply(start_to_ticks, axis=1, args=(piece,))

midi_data["end"] = midi_data.apply(end_to_ticks, axis=1, args=(piece,))

midi_data = piece.df.copy()

piece = piece.to_midi()

midi_data["start"] = midi_data.apply(start_to_ticks, axis=1, args=(piece,))

midi_data["end"] = midi_data.apply(end_to_ticks, axis=1, args=(piece,))

Your version modifies the internal dataframe of the piece, which could lead to serious confusion if the same object had to be used for any other processing after chord detection. Making a copy is a safe way to prevent those problems.

Thank you very much!

WojciechMat · 2023-08-01T14:59:24Z

I have incorporated your suggestions and added functionality.

Chords

Chord recognition

I decided to use 24 ticks as a threshold for maximum chord duration. The results of finding most repeated chord in databes have changed and are more consistent with the results in multiples.csv

Most repeated chords

Here are new results of finding the most repeated chord for each piece:

most-played-chords-v2.csv

The piece with the most repetitions is:
composer: Franz Schubert
title: Sonata in D Major, D850
chord: DM
repetitions 1582

Chord tables

Here is an example of chord table created for Edvard Erieg, Lyric piece in e minor, “waltz,” op 38 no 7:

csv files:

franz-liszt-dante-sonata.csv

franz-schubert-sonata-in-d-major,-d850.csv

Finding inconsistent data - notes

I have updated similarity threshold. The algorithm does result in some false-positives (pieces that are the same are sometimes flagged as not similar ), but it can help in renaming the data.

The method with Leveshtein distance does not recognize the same pieces in different keys as similar.

new csv files:
multiples.csv

inconsistent-data.csv

WojciechMat added 2 commits July 30, 2023 22:28

chords implementation and docstrings - completed part easy

8284c02

deleted csv file

c6fbecc

WojciechMat added 2 commits August 1, 2023 01:28

optimized and increased accuracy

e26b630

functions to process the database in search for false-similars

dfcbf23

WojciechMat changed the title ~~Wojciech Matejuk: easy part~~ Wojciech Matejuk Jul 31, 2023

roszcz reviewed Aug 1, 2023

View reviewed changes

better chord recognition, updated similarity threshold, chord tables

8300747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wojciech Matejuk #11

Wojciech Matejuk #11

WojciechMat commented Jul 30, 2023 •

edited

Loading

roszcz commented Jul 31, 2023

WojciechMat commented Jul 31, 2023 •

edited

Loading

roszcz commented Jul 31, 2023

WojciechMat commented Jul 31, 2023

roszcz commented Jul 31, 2023

WojciechMat commented Aug 1, 2023 •

edited

Loading

roszcz Aug 1, 2023

roszcz Aug 1, 2023 •

edited

Loading

roszcz Aug 1, 2023

WojciechMat Aug 1, 2023

WojciechMat commented Aug 1, 2023 •

edited

Loading

	def process_dataset(dataset, csv_file_path):
	def process_dataset(dataset: Dataset) -> pd.DataFrame:

Wojciech Matejuk #11

Are you sure you want to change the base?

Wojciech Matejuk #11

Conversation

WojciechMat commented Jul 30, 2023 • edited Loading

Speed

Chords

Optimizing Chord Recognition Algorithm:

Finding the false-similars in a dataset

1. Distance measure.

2. Feature extraction.

3. Results.

roszcz commented Jul 31, 2023

WojciechMat commented Jul 31, 2023 • edited Loading

roszcz commented Jul 31, 2023

WojciechMat commented Jul 31, 2023

roszcz commented Jul 31, 2023

WojciechMat commented Aug 1, 2023 • edited Loading

roszcz Aug 1, 2023

Choose a reason for hiding this comment

roszcz Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

roszcz Aug 1, 2023

Choose a reason for hiding this comment

WojciechMat Aug 1, 2023

Choose a reason for hiding this comment

WojciechMat commented Aug 1, 2023 • edited Loading

Chords

Chord recognition

Most repeated chords

Chord tables

Finding inconsistent data - notes

WojciechMat commented Jul 30, 2023 •

edited

Loading

WojciechMat commented Jul 31, 2023 •

edited

Loading

WojciechMat commented Aug 1, 2023 •

edited

Loading

roszcz Aug 1, 2023 •

edited

Loading

WojciechMat commented Aug 1, 2023 •

edited

Loading