Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensed related charts not being generated for open-access-openpath #132

Open
iantei opened this issue Apr 19, 2024 · 11 comments
Open

Sensed related charts not being generated for open-access-openpath #132

iantei opened this issue Apr 19, 2024 · 11 comments

Comments

@iantei
Copy link
Contributor

iantei commented Apr 19, 2024

Currently, there is issue with the generation of open-access-openpath:
https://open-access-openpath.nrel.gov/public/

The primary reason is unavailability of column cleaned_section_summary in expanded_ct dataframe.

Error call stack:

AttributeError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 expanded_ct, file_suffix, quality_text, debug_df = scaffolding.load_viz_notebook_sensor_inference_data(year,
      2                                                                             month,
      3                                                                             program,
      4                                                                             include_test_users,
      5                                                                             sensed_algo_prefix)

File /usr/src/app/saved-notebooks/scaffolding.py:229, in load_viz_notebook_sensor_inference_data(year, month, program, include_test_users, sensed_algo_prefix)
    227 print(f"Expanded_ct columns: \n {expanded_ct.columns}")
    228 if len(expanded_ct) > 0:
--> 229     expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
    230     expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
    231     valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'cleaned_section_summary'

For the dataset fc_*, whichhas issue with creating sensed related charts, below are the expanded_ct columns:

 Expanded_ct columns: 
 Index(['source', 'end_ts', 'end_fmt_time', 'end_loc', 'raw_trip', 'start_ts',
       'start_fmt_time', 'start_loc', 'duration', 'distance', 'start_place',
       'end_place', 'cleaned_trip', 'inferred_labels', 'inferred_trip',
       'expectation', 'confidence_threshold', 'expected_trip', 'user_input',
       'start_local_dt_year', 'start_local_dt_month', 'start_local_dt_day',
       'start_local_dt_hour', 'start_local_dt_minute', 'start_local_dt_second',
       'start_local_dt_weekday', 'start_local_dt_timezone',
       'end_local_dt_year', 'end_local_dt_month', 'end_local_dt_day',
       'end_local_dt_hour', 'end_local_dt_minute', 'end_local_dt_second',
       'end_local_dt_weekday', 'end_local_dt_timezone', '_id', 'user_id',
       'metadata_write_ts'],
      dtype='object')

For the dataset openpath_prod_cortezebikes which doesn't have issue with creating sensed related charts.


Expanded_ct columns: 
Index(['source', 'end_ts', 'end_fmt_time', 'end_loc', 'raw_trip', 'start_ts',
      'start_fmt_time', 'start_loc', 'duration', 'distance', 'start_place',
      'end_place', 'cleaned_trip', 'inferred_labels', 'inferred_trip',
      'expectation', 'confidence_threshold', 'expected_trip', 'user_input',
      'additions', 'inferred_section_summary', 'cleaned_section_summary',
      'start_local_dt_year', 'start_local_dt_month', 'start_local_dt_day',
      'start_local_dt_hour', 'start_local_dt_minute', 'start_local_dt_second',
      'start_local_dt_weekday', 'start_local_dt_timezone',
      'end_local_dt_year', 'end_local_dt_month', 'end_local_dt_day',
      'end_local_dt_hour', 'end_local_dt_minute', 'end_local_dt_second',
      'end_local_dt_weekday', 'end_local_dt_timezone', '_id', 'user_id',
      'metadata_write_ts'],
     dtype='object')
_default

The difference in columns for expanded_ct while using these two dataset are enlisted below:

-  'additions', 
- 'inferred_section_summary', 
- 'cleaned_section_summary' 
@iantei iantei moved this to Issues being worked on in OpenPATH Tasks Overview Apr 19, 2024
@iantei
Copy link
Contributor Author

iantei commented Apr 20, 2024

Upon further investigation:

def load_all_confirmed_trips(tq):
    agg = esta.TimeSeries.get_aggregate_time_series()
    all_ct = agg.get_data_df("analysis/confirmed_trip", tq)
    print("Loaded all confirmed trips of length %s" % len(all_ct))
    print(f"Columns of all_ct: {all_ct.columns} \n")
    disp.display(all_ct.head())
    return all_ct

The all_ct data frame doesn't have additions, inferred_section_summary and cleaned_section_summary columns.

Need to understand further, why these columns are missing - which is coming from analysis/confirmed_trip.

@iantei
Copy link
Contributor Author

iantei commented Apr 22, 2024

Looking into the server side of code:

Inside emission/analysis/userinput/matcher.py

def create_confirmed_entry(ts, tce, confirmed_key, input_key_list):
    # Copy the entry and fill in the new values
    confirmed_object_data = copy.copy(tce["data"])
    # del confirmed_object_dict["_id"]
    # confirmed_object_dict["metadata"]["key"] = confirmed_key
    if (confirmed_key == esda.CONFIRMED_TRIP_KEY):
        confirmed_object_data["expected_trip"] = tce.get_id()
        logging.debug("creating confimed entry from %s" % tce)
        cleaned_trip = ts.get_entry_from_id(esda.CLEANED_TRIP_KEY,
            tce["data"]["cleaned_trip"])
        confirmed_object_data['inferred_section_summary'] = get_section_summary(ts, cleaned_trip, "analysis/inferred_section")
        confirmed_object_data['cleaned_section_summary'] = get_section_summary(ts, cleaned_trip, "analysis/cleaned_section")
    elif (confirmed_key == esda.CONFIRMED_PLACE_KEY):
        confirmed_object_data["cleaned_place"] = tce.get_id()
    confirmed_object_data["user_input"] = \
        get_user_input_dict(ts, tce, input_key_list)
    confirmed_object_data["additions"] = \
        esdt.get_additions_for_timeline_entry_object(ts, tce)
    return ecwe.Entry.create_entry(tce['user_id'], confirmed_key, confirmed_object_data)

We have "additions", "cleaned_section_summary" and "inferred_section_summary" missing.
@shankari Could we have access to the server log so we can understand why this is happening?

@iantei iantei moved this from Issues being worked on to Questions for Shankari in OpenPATH Tasks Overview Apr 22, 2024
@Abby-Wheelis
Copy link
Member

When we do look at the server logs, I think it would help to look first for the log statements from 'get_section_summary'

@Abby-Wheelis
Copy link
Member

Yesterday evening/this morning I had a problem with the sensed notebook on my survey additions branch see here. This isn't the same error, as it happened later when making the 80% chart, but we should keep an eye out for that case once this error is resolved and when testing the stacked bar chart changes.

@Abby-Wheelis
Copy link
Member

Just checked on open-access and the behavior there is different than the error I was working with. In my case, the number of trips (sensed) was ok, but the entire notebook errored out on the number of trips under 80% (sensed) chart. If it was the same error, I would expect to see the first chart, but all of the sensed charts are nulled out and none of them are showing.

@iantei
Copy link
Contributor Author

iantei commented Apr 25, 2024

Tried to load the dataset into Mongo, using the below script:

bash viz_scripts/docker/load_mongodump.sh <mongodump_file>

for the snapshot of open-access dataset April 24. The dataset is considerable huge i.e. ~ 4.4 GB.

With the resource maxed to 16 GB Container Memory and 10 core for Container CPU Usage.
The entire dataset could not be loaded, resulting in below case:

Terminal Docker Resource Profile Chart
Screenshot 2024-04-25 at 3 59 59 AM Screenshot 2024-04-25 at 3 59 13 AM

Corresponding to the resource usage on the right, the script exited early, as it reached the threshold of the container memory allocation.

Next: Trying with the Container Resource CPU core allocated to 16 core.

@shankari
Copy link
Contributor

Please see the workaround for loading less data for testing the public dashboard

@iantei
Copy link
Contributor Author

iantei commented Apr 25, 2024

Error Stack:

Is there any cleaned_section_summary which has NaN values?: True
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 expanded_ct_sensed, file_suffix_sensed, quality_text_sensed, debug_df_sensed = scaffolding.load_viz_notebook_sensor_inference_data(year,
      2                                                                             month,
      3                                                                             program,
      4                                                                             include_test_users,
      5                                                                             sensed_algo_prefix)

File /usr/src/app/saved-notebooks/scaffolding.py:246, in load_viz_notebook_sensor_inference_data(year, month, program, include_test_users, sensed_algo_prefix)
    242 if len(expanded_ct) > 0:
    244     print(f"Is there any cleaned_section_summary which has NaN values?: {participant_ct_df['cleaned_section_summary'].isna().any()}")
--> 246     expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
    247     expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
    248     valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/series.py:4771, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4661 def apply(
   4662     self,
   4663     func: AggFuncType,
   (...)
   4666     **kwargs,
   4667 ) -> DataFrame | Series:
   4668     """
   4669     Invoke function on values of Series.
   4670 
   (...)
   4769     dtype: float64
   4770     """
-> 4771     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/apply.py:1123, in SeriesApply.apply(self)
   1120     return self.apply_str()
   1122 # self.f is Callable
-> 1123 return self.apply_standard()

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/apply.py:1174, in SeriesApply.apply_standard(self)
   1172     else:
   1173         values = obj.astype(object)._values
-> 1174         mapped = lib.map_infer(
   1175             values,
   1176             f,
   1177             convert=self.convert_dtype,
   1178         )
   1180 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1181     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1182     #  See also GH#25959 regarding EA support
   1183     return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

File /usr/src/app/saved-notebooks/scaffolding.py:246, in load_viz_notebook_sensor_inference_data.<locals>.<lambda>(md)
    242 if len(expanded_ct) > 0:
    244     print(f"Is there any cleaned_section_summary which has NaN values?: {participant_ct_df['cleaned_section_summary'].isna().any()}")
--> 246     expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
    247     expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
    248     valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]

TypeError: 'float' object is not subscriptable

Added the below line of code:

{participant_ct_df['cleaned_section_summary'].isna().any()}

Result: Is there any cleaned_section_summary which has NaN values?: True
which shows there are NaN values for cleaned_section_summary. Therefore, some operation over it would lead to the below error.

This seems identical to issue described here: Issue 93

@iantei
Copy link
Contributor Author

iantei commented Apr 25, 2024

Proposal for solution:

expanded_ct = participant_ct_df.copy()
expanded_ct = expanded_ct_copy.dropna(subset=['cleaned_section_summary'])
  1. Create a copy of participant_ct_df such that the original df is not modified.
  2. Drop the rows from the data frame wherever cleaned_section_summary is NaN.

@shankari
Copy link
Contributor

shankari commented Apr 26, 2024

@iantei dropna will just paper over the real issue. The cleaned_summary_section should always exist.
you can:

  • see if there are patterns around missing section summaries - maybe the backwards compat code was not executed on this deployment
  • run the pipeline on the snapshot to see where it fails

@shankari shankari moved this from Questions for Shankari to Issues being worked on in OpenPATH Tasks Overview Apr 26, 2024
@iantei
Copy link
Contributor Author

iantei commented May 4, 2024

There are 3878 records which has NaN for cleaned_section_summary.

Script to filter out NaN values' for cleaned_section_summary's end_fmt_time in sorted way

    nan_rows = participant_ct_df[participant_ct_df['cleaned_section_summary'].isna()]
    print(len(nan_rows))
    end_fmt_times = []

    for index, row in nan_rows.iterrows():
        end_fmt_times.append(row['end_fmt_time'])
    end_fmt_times.sort()

    # Print the sorted list)
    for timestamp in end_fmt_times:
        print(timestamp)

There is an observation for pattern:

2022-07-07T20:52:04.129278-07:00
2022-07-07T21:46:43.999819-07:00
...
2023-08-04T15:10:53.000056-04:00
2023-08-04T15:15:06.000034-04:00
2023-08-04T17:28:29.755166-04:00
2023-08-04T18:45:00.000004-04:00

All these entries have timestamp with end_fmt_time prior to the deliverable of #92 which was delivered on 11th September 2023.
This indicates a strong likelyhood of the possibility you mentioned about backwards compat code not being executed on this deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Issues being worked on
Development

No branches or pull requests

3 participants