0 votes
1 view
ago in Data Science by (840 points)

I have been researching on structuring the data science project and also to reuse the project by using the fundamental philosophy of python programming language. Below is the code mentioned of my project:

├── README.md          

├── data

│   ├── processed          <-- data files

│   └── raw                            

├── notebooks  

|   └── notebook_1                             

├── setup.py              


├── settings.py            <-- settings file   

└── src                

    ├── __init__.py    


    └── data           

        └── get_data.py    <-- script  

'data/processed' is used to load the data. Now, I wanted to use the data/processed to load the data in other scripts of my jupyter notebook which is located in .notebooks

def data_sample(code=None):

    df = pd.read_parquet('../../data/processed/my_data')

    if not code:

        code = random.choice(df.code.unique())

    df = df[df.code == code].sort_values('Date')

    return df

This function won't work until it is been run directly on the script were the data/processed is been defined. So I decided to build a settings.py where I declared as shown below:

from os.path import join, dirname

DATA_DIR = join(dirname(__file__), 'data', 'processed')

final I can code as shown below:

from my_project import settings

import os

def data_sample(code=None):

    file_path = os.path.join(settings.DATA_DIR, 'my_data')

    df = pd.read_parquet(file_path)

    if not code:

        code = random.choice(df.code.unique())

    df = df[df.code == code].sort_values('Date')

    return df

Now my question is, am I using the settings.py correctly? 

Is there any other way to do it?

settings.DATA_DIR looks ugly can I change it?

1 Answer

0 votes
ago by (1.5k points)
edited ago by

I am building an e-commerce python project based on the data-driven cookies cutter. my personal opinion on the temple is great because you can divide the data folder and code. you can work directly on transmission flow, starting with immutable data and going to final results.

 Initially, I started using pkg_resources but later I stopped using it because it had a log syntax which was difficult for me to understand.

So I started coding on my own for the root folder as shown below:

# shorter version 

ROOT = Path(__file__).parents[3]

# longer version

def find_repo_root():

    """Returns root folder for repository.

    Current file is assumed to be:

        <repo_root>/src/kep/helper/<this file>.py


    levels_up = 3

    return Path(__file__).parents[levels_up]

ROOT = find_repo_root()

DATA_FOLDER = ROOT / 'data' 

UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')

XL_PATH = str(ROOT / 'output' / 'kep.xlsx')

This is similar to the on which you have done on DATA_DIR. The only difference is that I have hardcoded the location of my 'helper file' concerning my root project. In case my file location changes then I need to change my code.

The other way is to allow the specific data in raw, interim and processed folders.

Its the simplest method where you need to include the complete path and file name in the folder, for example:

def interim(filename):

    """Return path for *filename* in 'data/interim folder'."""

    return str(ROOT / 'data' / 'interim' / filename)

In your project, you can fix the root directory were easy, go to the setting.py and abstract your data in it.

data_sample() is been mixed by access and data transformation files which are not correct. It also uses a global name. I suggest you code as follows:

# keep this in setting.py

def processed(filename):

   return os.path.join(DATA_DIR, filename)

# this works on a dataframe - your argument is a dataframe,

# and you return a dataframe

def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:

    # FIXME: what is `code`?

    if not code:

        code = random.choice(df.code.unique())

    return df[df.code == code].sort_values('Date')

# make a small but elegant pipeline of data transfomation

file_path = processed('my_data')

df0 = pd.read_parquet(file_path)

df = transform_sample(df0)

Improve your knowledge in data science from scratch by click on the click Data Science

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
Welcome to Intellipaat Community. Get your technical queries answered by top developers !