Elegant way to refer to files in data science project

Question

asked Mar 26, 2020 in Data Science by blackindya (18.4k points)

I have been researching on structuring the data science project and also to reuse the project by using the fundamental philosophy of python programming language. Below is the code mentioned of my project:

├── README.md
├── data
│ ├── processed <-- data files
│ └── raw
├── notebooks
| └── notebook_1
├── setup.py
|
├── settings.py <-- settings file
└── src
├── __init__.py
│
└── data
└── get_data.py <-- script

'data/processed' is used to load the data. Now, I wanted to use the data/processed to load the data in other scripts of my jupyter notebook which is located in .notebooks

def data_sample(code=None):
df = pd.read_parquet('../../data/processed/my_data')
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df

This function won't work until it is been run directly on the script were the data/processed is been defined. So I decided to build a settings.py where I declared as shown below:

from os.path import join, dirname
DATA_DIR = join(dirname(__file__), 'data', 'processed')

final I can code as shown below:

from my_project import settings
import os
def data_sample(code=None):
file_path = os.path.join(settings.DATA_DIR, 'my_data')
df = pd.read_parquet(file_path)
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df

Now my question is, am I using the settings.py correctly?

Is there any other way to do it?

settings.DATA_DIR looks ugly can I change it?

1 Answer

supriya · Answer 1 · 2020-03-26T06:47:48+0000

I am building an e-commerce python project based on the data-driven cookies cutter. my personal opinion on the temple is great because you can divide the data folder and code. you can work directly on transmission flow, starting with immutable data and going to final results.

Initially, I started using pkg_resources but later I stopped using it because it had a log syntax which was difficult for me to understand.

So I started coding on my own for the root folder as shown below:

# shorter version
ROOT = Path(__file__).parents[3]
# longer version
def find_repo_root():
"""Returns root folder for repository.
Current file is assumed to be:
<repo_root>/src/kep/helper/<this file>.py
"""
levels_up = 3
return Path(__file__).parents[levels_up]
ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data'
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')

This is similar to the on which you have done on DATA_DIR. The only difference is that I have hardcoded the location of my 'helper file' concerning my root project. In case my file location changes then I need to change my code.

The other way is to allow the specific data in raw, interim and processed folders.

Its the simplest method where you need to include the complete path and file name in the folder, for example:

def interim(filename):
"""Return path for *filename* in 'data/interim folder'."""
return str(ROOT / 'data' / 'interim' / filename)

In your project, you can fix the root directory were easy, go to the setting.py and abstract your data in it.

data_sample() is been mixed by access and data transformation files which are not correct. It also uses a global name. I suggest you code as follows:

# keep this in setting.py
def processed(filename):
return os.path.join(DATA_DIR, filename)
# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
# FIXME: what is `code`?
if not code:
code = random.choice(df.code.unique())
return df[df.code == code].sort_values('Date')
# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)

Improve your knowledge in data science from scratch by click on the click Data Science

Elegant way to refer to files in data science project

Elegant way to refer to files in data science project

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions