Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The problem of the missing authors #77

Open
2 of 8 tasks
benlk opened this issue Jun 11, 2020 · 14 comments
Open
2 of 8 tasks

The problem of the missing authors #77

benlk opened this issue Jun 11, 2020 · 14 comments

Comments

@benlk
Copy link
Contributor

benlk commented Jun 11, 2020

The db-based approach:

  • get the new database from GS
  • try the export thing that GS made

  • extract all largo_custom_byline post meta from the postmeta table, in the form of an UPDATE statement
  • make a list of some affected post IDs
  • test applying that statement to a copy of the prod db
  • revise the statement as necessary
  • apply to a staging clone of prod via the Flywheel db management interfaces
  • apply to prod
@benlk
Copy link
Contributor Author

benlk commented Jun 17, 2020

try the export thing that GS made

authorship_fixer.php runs against the database and, using lookup tables in the .php file, does the following:

  1. for posts with [email protected] or [email protected] as author:
    • if those posts existed in Drupal as recorded by the "_fgd2wp_old_node_id" post meta key value for the migrated node ID: find the earliest Drupal node revision corresponding to that post's node ID, find the Drupal author UID of that node revision.
    • if there is a WP User corresponding by email address to the Drupal author, update the WP post_author to tha WP User
    • if no WP User is found, print an error message
    • if no Drupal author UID is found in the stored data tables, print an error message
    • if no Drupal post is found in the lookup tables with a node ID that matches the "_fgd2wp_old_node_id" post meta key value for the node ID, the post is assigned to the WP user for "[email protected]"
  2. for all posts in the database, If the post has "_fgd2wp_old_node_id" post meta:
    • if the post has corresponding data from the lookup table of Drupal author entity ID to author name, set that author name as the "largo_custom_byline" meta
    • if not, check a different lookup table to see if there's an author profile associated with the post, and if it exists, set the "largo_custom_byline" meta to the name of the author

@benlk
Copy link
Contributor Author

benlk commented Jun 17, 2020

One point that seems risky to me:

if no Drupal post is found in the lookup tables with a node ID that matches the "_fgd2wp_old_node_id" post meta key value for the node ID, the post is assigned to the WP user for "[email protected]"

However, in the test run of this script on their site, the printed log returned no logs that matched the message that the code would output in this case where a post is assigned to a generic user.

@benlk
Copy link
Contributor Author

benlk commented Jun 17, 2020

This script does not chunk by posts; instead it runs against all -1 posts in the same page load, and subsequently times out on my computer somewhere around post 1964, when starting from ~5600.

However, subsequent page loads process newly-updated records faster, resulting in the script eventually completing.

There's a large amount of this:

Notice: Trying to get property 'ID' of non-object in /Users/blk/sites/workdayminn/authorship_fixer.php on line 6326

5638 WELL! no valid drupal post revisions found... assigning to generic user

This is the message output with the "risky" case described at #77 (comment)

The non-object warning appears to be because that line is checking for $generic = get_user_by("email", "[email protected]"); which is returning no user on my local db

  • figure out whether the "risky" case happens when run against the prod db

@benlk
Copy link
Contributor Author

benlk commented Jun 17, 2020

Yes, the "risky" case happens on the prod db, and yes, the [email protected] user does not exist in prod. (In general, good that it doesn't exist. For this script, bad.)

@benlk
Copy link
Contributor Author

benlk commented Jun 17, 2020

Posts triggering the "risky" case:
6300, 6298, 5716, 5694, 5689, 5690, 5687, 5688, and more than 1000 others according to Firefox's search of the log output.

Post 6300 has user 2 as author, the dread "tsuperadmin".

This "risky" case is appearing in Step 1 of #77 (comment), where the script is still trying to match Drupal authors to WP Users.

After running the script, post 6300 has user 0 as author. This is bad: it results in a byline reading "By "

When the script finishes running, post 6300 still has author 0, but interestingly for 6300, it also has no largo_custom_byline metadata set. The script log reads:

6300 NID:30946 ERROR! no byline data found. cause: no $authref[0]["field_author_reference_target_id"]

$ wp post meta list 6300
| post_id | meta_key            | meta_value                                                                      |
+---------+---------------------+---------------------------------------------------------------------------------+
| 6300    | _fgd2wp_old_node_id | 30946                                                                           |
| 6300    | _thumbnail_id       | 985                                                                             |
| 6300    | wpcf-media          | https://wdmdev.gravityswitch.com/wp-content/uploads/2015/05/labor_education.gif |
| 6300    | _encloseme          | 1                                                                               |

So, for some posts, this script will result in posts changing authorship from tsuperadmin to no author at all!

Next steps:

  • check the Drupal DB to see if the corresponding node ID has any information at all, or whether the lack of information in this script accurately reflects the database
  • pick an author to be the fallback author, replacing [email protected]

@benlk
Copy link
Contributor Author

benlk commented Jun 17, 2020

The missing-author-info case is known and accepted.

The fallback author will be tsuperadmin, with the user display name changed to "Workday Minnesota".

To do:

  • refresh staging
  • update the script to use tsuperadmin's email address, [email protected]
  • update staging to have the correct byline for tsuperadmin
  • upload the script
  • run it to completion
  • save the output of the script as a text file
  • cat output.txt | grep 'field_author_reference_target_id' > log-of-posts-without-bylines.txt
  • cat log-of-posts-without-bylines.txt | cut -d ' ' -f 1 - | sort -g > post_id-without-bylines.txt to get the list of just post IDs
  • wc post_id-without-bylines.txt to get a count of those posts, and compare it with tsuperadmin's post count in the admin
  • pass the updated site to Workday for review, along with the list of post IDs and a short explanation of what that file is.

@benlk
Copy link
Contributor Author

benlk commented Jun 18, 2020

Sorry for the delay here! It looks like the script is looking for the wp-load.php file to be in the root. It is not in the root on our Flywheel Cloud Platform. However, there is a workaround for this. I added a wp-load.php file in the root which then links to the correct wp-load.php file location.

This should do the trick, but if you continue to have any issues, please let us know!

The support tech added this file on prod, not staging, but it's easy to copy from one to the other:

wp-load.php:

<?php require_once('.wordpress/wp-load.php'); ?>

However, Flywheel Staging does not print the raw output of the script as it works. It instead outputs a 504 Gateway Timeout error. We don't know how much progress is being made.

@benlk
Copy link
Contributor Author

benlk commented Jun 18, 2020

private repo for the revised authorship fixer scripts: https://github.com/INN/workday-author-fixer

@benlk
Copy link
Contributor Author

benlk commented Jun 18, 2020

a downside of timeout-based processing with Flywheels' Gateway timeout approach is that there's no way to get a log that covers all changes made.

In the final run of bad_author_fixer.php, which tried to find authors for posts by [email protected], the "admin" user:

In the final run of authorship_fixer.php:

  • The final output of the script is, in the end, the 7663 posts without a _fgd2wp_old_node_id post meta that indicates there was a corresponding Drupal post. This list covers both new posts from after the migration and posts that, due to their age, were likely migrated. The full list of post IDs is this:
    authorship_fixer.no-_fgd2wp_old_node_id.post_ids.txt
  • That list incorporates in its entirety the list of posts with no WP author matching the saved Drupal author. The list of posts lacking a Drupal author is a subset of the posts that lack the post meta indicating they were Drupal posts, which does not make sense because the author lookup should only have happened if the post had a _fgd2wp_old_node_id

In conclusion:

@benlk
Copy link
Contributor Author

benlk commented Jul 1, 2020

Production run notes:

In the final run of bad_author_fixer.php, which tried to find authors for posts by [email protected], the "admin" user:

In the final run of authorship_fixer.php:

In conclusion:

@MirandaEcho
Copy link

Thanks @benlk! How did this compare to the staging run?

@benlk
Copy link
Contributor Author

benlk commented Jul 1, 2020

Slightly fewer posts with a known Drupal author but no corresponding WordPress author; I think they might've set some WP authors manually on prod.

Same number of posts by the "Admin" user.

@MirandaEcho
Copy link

Thanks! Let's rename the author as you suggested, as we did with SFPP.

@benlk
Copy link
Contributor Author

benlk commented Jul 1, 2020

Author display name updated to "Workday Minnesota Staff"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants