Data can be obtained in several ways—for example by downloading it from a server, by querying a database, or by connecting to a web API. Sometimes, the data comes in a compressed form or in a binary format such as Microsoft Excel. You may need to do one or more of the following:
- Download data from the Internet
- Decompress files
- Calling Web APIs
- Querying Relational Databases
- Convert Microsoft Excel spreadsheet into usable data
The main problem in data acquisition is keeping track of its origin that means where each piece of data comes from and whether it is still up-to-date. It is important to accurately track the origin, since data often needs to be re-acquired in the future to run updated experiments.
Read these Top Trending Data Science Interview Q’s now that helps you grab high-paying jobs!
Re-acquisition can happen either when researchers want to test alternate hypotheses or when the original data sources get updated. Data management is a related problem where programmers must allocate names to data files that they create or download and then organize those files into directories. When they create or download the new versions of those files, they must make sure to assign proper filenames to all versions and keep track of their differences.
Second problem in data acquisition is storage. Sometimes there is large number of data which cannot fit on a single hard drive, so it should be stored on remote servers. Anecdotes and empirical studies specify that a considerable amount of data analysis is still done on desktop machines with data sets that fit on modern hard drives i.e. less than a terabyte.
Looking for Top Jobs in Data Science ? This blog post gives you all the information you need!
Downloading from the Internet
The Internet provides the largest amount of data. This data is available in various forms using a variety of protocols. The command-line tool cURL can be used to download data from the Internet.
When you access a URL, which means uniform resource locator, through your browser, the data which is downloaded can be interpreted. For example, an HTML file is rendered as a website, a PDF file may be automatically opened by a viewer etc. However, when cURL is used to access a URL, the data is downloaded as is, and is printed to standard output.
The easiest invocation of curl is to simply specify a URL as a command-line argument.
For example, to download Mark Twain’s Adventures of Huckleberry Finn from Project Gutenberg, we can run the following command:
$ curl -s http://www.gutenberg.org/cache/epub/76/pg76.txt | head -n 10
By default curl outputs a progress meter that displays the download rate and the expected time of completion.
Querying Relational Databases
You can also get the data from relational databases by performing a SELECT query on them and sql2csv also supports update, insert, and delete queries. The output of sql2csv is in CSV format.
Sql2csv is a commnd line tool which is the part of Csvkit suite because it influences the Python SQLAlchemy package, you only have to use one tool to execute queries on many different databases including SQLite, MySQL, Oracle, Sysbase etc.
Learn more about Data Science in this insightful blog now!
If the original data is very large or is a collection of many files, then compression is used. Data sets which have many repeated data like the words in a text file or the keys in a JSON file are fitted for compression.
Common file extensions of compressed archives are: .tar.gz, .zip, and .rar. To decompress these files, you can use the command-line tools tar, unzip, and unrar respectively.
Calling Web APIs
To get data there is also another method where the data can come from the internet is through a web API where API stands for application programming interface. Web APIs are not meant to be presented in a nice layout, such as websites. Data which is in structured form has the advantage that the data can be easily processed by other tools, like as jq.
Converting Microsoft Excel Spreadsheets
Microsoft Excel offers an intuitive way to work with small data sets and do calculations on them. As a output a lot of data is embedded into Microsoft
Excel spreadsheets. These spreadsheets are depend on the extension of the Filename which is stored in either a proprietary binary format (.xls) or as a collection of compressed XML files (.xlsx). In these both cases the data is not readily usable by most command-line tools.
Command-line tool in2csv is able to change Microsoft Excel spreadsheets to CSV files. CSV stands for comma-separated values or character-separated values. Working with CSV can be difficult because it not have a formal specification.