I am building an e-commerce python project based on the data-driven cookies cutter. my personal opinion on the temple is great because you can divide the data folder and code. you can work directly on transmission flow, starting with immutable data and going to final results.
Initially, I started using pkg_resources but later I stopped using it because it had a log syntax which was difficult for me to understand.
So I started coding on my own for the root folder as shown below:
# shorter version
ROOT = Path(__file__).parents
# longer version
"""Returns root folder for repository.
Current file is assumed to be:
levels_up = 3
ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data'
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')
This is similar to the on which you have done on DATA_DIR. The only difference is that I have hardcoded the location of my 'helper file' concerning my root project. In case my file location changes then I need to change my code.
The other way is to allow the specific data in raw, interim and processed folders.
Its the simplest method where you need to include the complete path and file name in the folder, for example:
"""Return path for *filename* in 'data/interim folder'."""
return str(ROOT / 'data' / 'interim' / filename)
In your project, you can fix the root directory were easy, go to the setting.py and abstract your data in it.
data_sample() is been mixed by access and data transformation files which are not correct. It also uses a global name. I suggest you code as follows:
# keep this in setting.py
return os.path.join(DATA_DIR, filename)
# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
# FIXME: what is `code`?
if not code:
code = random.choice(df.code.unique())
return df[df.code == code].sort_values('Date')
# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)
Improve your knowledge in data science from scratch by click on the click Data Science