Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed: Refactor scripts #91, Adds logging #85 and Cleans code and comments #86 by Putting it in a Shared Library/Module #96

Merged
merged 46 commits into from
Apr 2, 2024
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
cffadb3
Latest working
IamMQaisar Mar 14, 2024
8eaeabd
Merge branch 'main' of https://github.com/IamMQaisar/quantifying
IamMQaisar Mar 14, 2024
beda247
Latest working
IamMQaisar Mar 14, 2024
b4f4641
Merge branch 'creativecommons:main' into main
IamMQaisar Mar 15, 2024
a882cf4
Google_Scratcher Refined
IamMQaisar Mar 16, 2024
1e1bf8e
deviantart_scratcher refined
IamMQaisar Mar 16, 2024
f77ea63
internetarchive_scratcher refined
IamMQaisar Mar 16, 2024
f6e4163
Single legal_tool_paths.txt in the root dir
IamMQaisar Mar 16, 2024
ec3a4e9
Deleted Same Files legal_tool_paths.txt
IamMQaisar Mar 16, 2024
aa38124
Pre-Commit Check Auto-modifications
IamMQaisar Mar 16, 2024
9575002
deviantart added Logging
IamMQaisar Mar 16, 2024
63b86dd
Update deviantart_scratcher.py
IamMQaisar Mar 16, 2024
3916bf1
deviantart_scratch logging added
IamMQaisar Mar 16, 2024
b4d082e
flicker end set
IamMQaisar Mar 16, 2024
5eedca7
flicker end set
IamMQaisar Mar 16, 2024
1f495bf
internetarchive logging added
IamMQaisar Mar 16, 2024
c1e564a
google_scratcher added logging
IamMQaisar Mar 16, 2024
4c8407a
root_path removed
IamMQaisar Mar 16, 2024
3ac4081
root_path removed
IamMQaisar Mar 16, 2024
b5f526b
root_path removed from internetarchive
IamMQaisar Mar 17, 2024
8f418cc
root_path removed from internetarchive
IamMQaisar Mar 17, 2024
0f750dc
WIP Shared Module Added
IamMQaisar Mar 17, 2024
38e8601
restructuring
IamMQaisar Mar 17, 2024
719d15f
Shared Lib Updated in dev. yout. quant.
IamMQaisar Mar 17, 2024
c2e926a
recommented dev file
IamMQaisar Mar 17, 2024
b3ce247
google scr. shared lib.
IamMQaisar Mar 17, 2024
70a29df
added module origin in logger
IamMQaisar Mar 17, 2024
8c4b9b5
precommit restructure
IamMQaisar Mar 17, 2024
f73d86a
All Files Loggers add with shared and clean code
IamMQaisar Mar 17, 2024
d94fc8f
removed unnecessary import paths or DATETIME
IamMQaisar Mar 17, 2024
33555f9
Code Organized
IamMQaisar Mar 17, 2024
dc6a1e1
Readme.md Updated
IamMQaisar Mar 18, 2024
6870f7a
readme updated
IamMQaisar Mar 18, 2024
8aef40a
Structure Document photos.json
JoinAsadullah Mar 18, 2024
36b8a52
Merge pull request #2 from JoinAsadullah/main
IamMQaisar Mar 18, 2024
e869f01
wikipedia_scratcher.py updated
IamMQaisar Mar 27, 2024
2d703c8
Merge branch 'creativecommons:main' into main
IamMQaisar Mar 27, 2024
162c67c
Merge branch 'main' into shared_module
IamMQaisar Mar 27, 2024
84d3e24
conflicts resolved
IamMQaisar Mar 27, 2024
7ad009a
Merge branch 'main' into shared_module
IamMQaisar Mar 29, 2024
dff96d3
precommit modified
IamMQaisar Mar 29, 2024
7a55164
Merge branch 'main' into shared_module
IamMQaisar Apr 1, 2024
48d5104
CONFLITS REMOVED
IamMQaisar Apr 1, 2024
0b0c3b7
Merge pull request #7 from creativecommons/main
IamMQaisar Apr 1, 2024
a3dd718
required changes applied
IamMQaisar Apr 2, 2024
b3d5e21
clean-up path handling (use os.path.join and remove unnecessary f-str…
TimidRobot Apr 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,15 +133,15 @@ directories to check:
- [ppypa/pipenv][pipenv]: _Python Development Workflow for Humans._
- [pre-commit][pre-commit]: _A framework for managing and maintaining
multi-language pre-commit hooks._
- [Logging][logging]: _Built-in Python logging module to implement a flexible logging system across shared modules._
[Logging][logging]: _Utilize the built-in Python logging module to implement a flexible logging system from a shared module._
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved

[ccospyguide]: https://opensource.creativecommons.org/contributing-code/python-guidelines/
[black]: https://github.com/psf/black
[flake8]: https://github.com/PyCQA/flake8
[isort]: https://pycqa.github.io/isort/
[pipenv]: https://github.com/pypa/pipenv
[pre-commit]: https://pre-commit.com/
[logging]: https://docs.python.org/3/howto/logging.html
[logging]: https://docs.python.org/3/library/logging.html


### GitHub Actions
Expand Down
Empty file added __init__.py
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
75 changes: 30 additions & 45 deletions analyze/data_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,9 @@
"""

# Standard library
import logging
import os.path
import os
import re
import sys
import traceback
import warnings

# Third-party
Expand All @@ -16,37 +14,21 @@
import pandas as pd
import plotly.express as px
import seaborn as sns
from wordcloud import STOPWORDS, WordCloud # noqa: E402

warnings.filterwarnings("ignore")
# First-party/Local
import quantify
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved

# Third-party
from wordcloud import STOPWORDS, WordCloud # noqa: E402
# Warning suppression /!\ Caution /!\
warnings.filterwarnings("ignore")

# Set the current working directory
PATH_WORK_DIR = os.path.dirname(os.path.abspath(__file__))
# Setup PATH_WORK_DIR, and LOGGER using quantify.setup()
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
_, PATH_WORK_DIR, _, _, LOGGER = quantify.setup(__file__)

# Set the current working directory
CWD = os.path.dirname(os.path.abspath(__file__))
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved

# Set up the logger
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

# Define both the handler and the formatter
handler = logging.StreamHandler()
formatter = logging.Formatter(
"%(asctime)s - %(levelname)s - %(name)s - %(message)s"
)

# Add formatter to the handler
handler.setFormatter(formatter)

# Add handler to the logger
LOG.addHandler(handler)

# Log the start of the script execution
LOG.info("Script execution started.")


def tags_frequency(csv_path, column_names):
"""
Expand All @@ -59,7 +41,7 @@ def tags_frequency(csv_path, column_names):
Example: ["tags", "description"]

"""
LOG.info("Generating word cloud based on tags.")
LOGGER.info("Generating word cloud based on tags.")

df = pd.read_csv(csv_path)
# Process each column containing tags
Expand All @@ -79,7 +61,7 @@ def tags_frequency(csv_path, column_names):
and str(row) != ""
and str(row) != "nan"
):
LOG.debug(f"Processing row: {row}")
LOGGER.debug(f"Processing row: {row}")
if "ChineseinUS.org" in str(row):
row = "ChineseinUS"
list2 += re.split(r"\s|(?<!\d)[,.](?!\d)", str(row))
Expand Down Expand Up @@ -168,7 +150,7 @@ def time_trend_helper(df):
Returns:
- DataFrame: DataFrame with counts of entries per year.
"""
LOG.info("Extracting year-wise count of entries.")
LOGGER.info("Extracting year-wise count of entries.")

year_list = []
for date_row in df["dates"][0:]:
Expand Down Expand Up @@ -196,7 +178,7 @@ def time_trend(csv_path):
Args:
- csv_path (str): Path to the CSV file.
"""
LOG.info("Generating time trend line graph.")
LOGGER.info("Generating time trend line graph.")

df = pd.read_csv(csv_path)
count_df = time_trend_helper(df)
Expand Down Expand Up @@ -239,7 +221,7 @@ def time_trend_compile_helper(yearly_count):
Returns:
- DataFrame: Filtered yearly count data.
"""
LOG.info("Filtering yearly trend data.")
LOGGER.info("Filtering yearly trend data.")

Years = np.arange(2018, 2023)
yearly_count["year"] = list(yearly_count.index)
Expand All @@ -249,7 +231,7 @@ def time_trend_compile_helper(yearly_count):
int(yearly_count["year"][num]) >= 2018
):
counts.append(yearly_count["Counts"][num])
LOG.info(f"{counts}")
LOGGER.info(f"{counts}")
final_yearly_count = pd.DataFrame(
list(zip(Years, counts)), columns=["Years", "Yearly_counts"]
)
Expand All @@ -260,7 +242,7 @@ def time_trend_compile():
"""
Compile yearly trends for different licenses and plot them.
"""
LOG.info("Compiling yearly trends for different licenses.")
LOGGER.info("Compiling yearly trends for different licenses.")

license1 = pd.read_csv("../flickr/dataset/cleaned_license1.csv")
license2 = pd.read_csv("../flickr/dataset/cleaned_license2.csv")
Expand Down Expand Up @@ -319,7 +301,7 @@ def time_trend_compile():
yearly_count6 = time_trend_compile_helper(yearly_count6)
yearly_count9 = time_trend_compile_helper(yearly_count9)
yearly_count10 = time_trend_compile_helper(yearly_count10)
LOG.info(f"{yearly_count1}")
LOGGER.info(f"{yearly_count1}")

# Plot yearly trend for all licenses
plt.plot(
Expand Down Expand Up @@ -408,20 +390,22 @@ def view_compare_helper(df):
Returns:
- int: Maximum views.
"""
LOG.info("Calculating maximum views of pictures under a license.")
LOGGER.info("Calculating maximum views of pictures under a license.")

highest_view = int(max(df["views"]))
df = df.sort_values("views", ascending=False)
LOG.info(f"DataFrame sorted by views in descending order: {df}")
LOG.info(f"Maximum views found: {highest_view}")
LOGGER.info(f"DataFrame sorted by views in descending order: {df}")
LOGGER.info(f"Maximum views found: {highest_view}")
return highest_view


def view_compare():
"""
Compare maximum views of pictures under different licenses.
"""
LOG.info("Comparing maximum views of pictures under different licenses.")
LOGGER.info(
"Comparing maximum views of pictures under different licenses."
)

license1 = pd.read_csv(
os.path.join(PATH_WORK_DIR, "../flickr/dataset/cleaned_license1.csv")
Expand Down Expand Up @@ -461,7 +445,7 @@ def view_compare():
maxs = []
for lic in licenses:
maxs.append(view_compare_helper(lic))
LOG.info(f"{maxs}")
LOGGER.info(f"{maxs}")
# Create DataFrame to store license and their maximum views
temp_data = pd.DataFrame()
temp_data["Licenses"] = [
Expand Down Expand Up @@ -517,7 +501,9 @@ def total_usage():
"""
Generate a bar plot showing the total usage of different licenses.
"""
LOG.info("Generating bar plot showing total usage of different licenses.")
LOGGER.info(
"Generating bar plot showing total usage of different licenses."
)

# Reads the license total file as the input dataset
df = pd.read_csv(
Expand All @@ -538,15 +524,14 @@ def main():


if __name__ == "__main__":
# Exception Handling
try:
main()
except SystemExit as e:
LOG.error(f"System exit with code: {e.code}")
LOGGER.error("System exit with code: %d", e.code)
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(e.code)
except KeyboardInterrupt:
LOG.info("(130) Halted via KeyboardInterrupt.")
LOGGER.info("Halted via KeyboardInterrupt.")
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(130)
except Exception:
LOG.error(f"(1) Unhandled exception: {traceback.format_exc()}")
LOGGER.exception("Unhandled exception:")
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(1)
65 changes: 34 additions & 31 deletions deviantart/deviantart_scratcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
import logging
import os
import sys
import traceback

# Third-party
import pandas as pd
Expand All @@ -16,26 +15,32 @@
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

sys.path.append(".")
# First-party/Local
import quantify # noqa: E402
import quantify

PATH_REPO_ROOT, PATH_WORK_DIR, PATH_DOTENV, DATETIME_TODAY = quantify.setup(
__file__
# Setup paths, Date and LOGGER using quantify.setup()
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
PATH_REPO_ROOT, PATH_WORK_DIR, PATH_DOTENV, DATETIME_TODAY, LOGGER = (
quantify.setup(__file__)
)

# Load environment variables
load_dotenv(PATH_DOTENV)

# Retrieve API keys
API_KEYS = os.getenv("GOOGLE_API_KEYS").split(",")

# Global Variable for API_KEYS indexing
API_KEYS_IND = 0

# Gets API_KEYS and PSE_KEY from .env file
API_KEYS = os.getenv("GOOGLE_API_KEYS").split(",")
PSE_KEY = os.getenv("PSE_KEY")

# Set up file path for CSV report
DATA_WRITE_FILE = os.path.join(
PATH_WORK_DIR,
f"data_deviantart_"
f"{DATETIME_TODAY.year}_{DATETIME_TODAY.month}_{DATETIME_TODAY.day}.csv",
DATA_WRITE_FILE = (
f"{PATH_WORK_DIR}"
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
f"/data_deviantart_"
f"{DATETIME_TODAY.year}_{DATETIME_TODAY.month}_{DATETIME_TODAY.day}.csv"
)
# Retrieve Programmable Search Engine key from environment variables
PSE_KEY = os.getenv("PSE_KEY")

# Set up the logger
LOG = logging.getLogger(__name__)
Expand All @@ -62,14 +67,15 @@ def get_license_list():
Provides the list of license from 2018's record of Creative Commons.

Returns:
- np.array: An array containing all license types that should be
searched via Programmable Search Engine.
- np.array:
An np array containing all license types that should be searched
via Programmable Search Engine (PSE).
"""
LOG.info("Retrieving list of license from Creative Commons' record.")

# Read license data from file
cc_license_data = pd.read_csv(
os.path.join(PATH_WORK_DIR, "legal-tool-paths.txt"), header=None
f"{PATH_REPO_ROOT}/legal-tool-paths.txt", header=None
)
# Define regex pattern to extract license types
license_pattern = r"((?:[^/]+/){2}(?:[^/]+)).*"
Expand Down Expand Up @@ -104,7 +110,7 @@ def get_request_url(license):
)
except Exception as e:
if isinstance(e, IndexError):
LOG.exception("Depleted all API Keys provided")
LOGGER.error("Depleted all API Keys provided")
else:
raise e

Expand Down Expand Up @@ -146,16 +152,14 @@ def get_response_elems(license):
# If quota limit exceeded, switch to the next API key
global API_KEYS_IND
API_KEYS_IND += 1
LOG.exception("Changing API KEYS due to depletion of quota")
LOGGER.error("Changing API KEYS due to depletion of quota")
return get_response_elems(license)
else:
raise e


def set_up_data_file():
"""Writes the header row to the file to contain DeviantArt data."""
LOG.info("Setting up data file by writing the header row.")

# Writes the header row to the file to contain DeviantArt data.
header_title = "LICENSE TYPE,Document Count"
with open(DATA_WRITE_FILE, "w") as f:
f.write(f"{header_title}\n")
Expand All @@ -164,9 +168,11 @@ def set_up_data_file():
def record_license_data(license_type):
"""Writes the row for LICENSE_TYPE to the file to contain DeviantArt data.
Args:
- license_type(str): A string representing the type of license.
It's a segment of the URL towards the license description. If not provided,
it defaults to None, indicating no assumption about the license type.
- license_type:
A string representing the type of license, and should be a segment
of its URL towards the license description. Alternatively, the
default None value stands for having no assumption about license
type.
"""
LOG.info(
"Writing the row for license type %s to contain DeviantArt data",
Expand All @@ -187,11 +193,8 @@ def record_all_licenses():
list and writes this data into the DATA_WRITE_FILE, as specified by the
constant.
"""
LOG.info("Recording data for all available license types.")

# Get the list of license types
# Gets the list of license types and record data for each license type
license_list = get_license_list()
# Record data for each license types
for license_type in license_list:
record_license_data(license_type)

Expand All @@ -206,11 +209,11 @@ def main():
try:
main()
except SystemExit as e:
LOG.error(f"System exit with code: {e.code}")
LOGGER.error("System exit with code: %d", e.code)
sys.exit(e.code)
except KeyboardInterrupt:
LOG.info("(130) Halted via KeyboardInterrupt.")
LOGGER.info("Halted via KeyboardInterrupt.")
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(130)
except Exception:
LOG.error(f"(1) Unhandled exception: {traceback.format_exc()}")
LOGGER.exception("Unhandled exception:")
IamMQaisar marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(1)
Loading