Importing CSV: Meeting the Challenge Head-on

Article Friday, July 8 2022

Whether a machine learning engineer, data scientist, business analyst – or honestly, anyone else at this point – you’ve probably received, used, imported or opened a CSV (comma-separated value) file . A well-known, oft-used text-based file, CSV is used to store large data records in a format that is easily transferable, and human readable. The simplicity of using commas as a field separator has ensured simple parsing, and enabled the proliferation of CSV readers and integrations across applications and platforms.

This simplicity is now sometimes seen as a hindrance in the era of big data and data-driven decision making. If your data pipeline is completely in house, it can be easier to migrate to other data formats such as SQL. Standardizing with one data format can reduce cost and improve efficiency. CSV and spreadsheet formats are great for transferring tabular data, but require transformation to be usable by many of these database solutions. The moment third party companies with their own data flows are introduced, particularly if they include further third party entities such as customers and contractors, CSV may just be the best possible option to bridge the data translation gap. In this case, maintaining a CSV importer is essential to maintaining compatibility and increasing ease of use. However, there can be a host of problems that emerge when importing a CSV, and your company needs to be prepared to deal with these:

Data Validation and Schema Verification

Data Validation and Schema Verification is a problem common to nearly any data source, but is especially prevalent in CSVs. The simple nature of CSVs enable column names and data fields but not much else. The format offers no metadata to identify the data types in each column, and there is no guidance whether fields are required or optional, equivalent column names or data field schematics.

Example of Messy Data: Medium

To handle these limitations within CSV, many employ a strict schema approach, establishing rigid format requirements upon imported CSV files to match the target system’s schema exactly. This may work great with tightly integrated dataflows, where in-house team members know the rigid requirements and routinely work with them. However, when you introduce data from other organizations or from individual customers, this approach puts an undue burden of effort on the third party to rework their data to prepare it for ingestion.

If data from outside your organization is a requirement, and you do not want the outside agent to spend hours recreating thousands of data records to meet your strict schema requirements, you need to provide fuzzy matching, automatic value parsing, and manage a number of complex edge cases in order to consolidate the data into your standardized format. Existing libraries can help companies build such a tool, but they often are rigid in their implementation and won’t be able to adapt to all the mistakes and edge cases that will emerge in production. Open source libraries such as messytables, Datalib and CSVkit use heuristics that can detect a limited number of data types, but to make these useful in production, companies must extend their functionality to deal with domain-specific issues. This all results in significant time and human resource expenditure.

Data Transformations

A second common issue that often crops up when dealing with CSVs is data transformations.

Example Data Transformation: Safe Software

Though programs like Excel can create and handle data transformations, they often require conversion to a proprietary data format and are typically not portable across systems. In other words, while you gain a little bit of functionality for tabular data, you lose some of the key benefits of CSVs. Data transformations are often necessary though, especially when translating data between applications, between organizations, or even between different organizational teams. For instance, the organization providing the file might have a single field for address, but the target has multiple fields for sub-components such as street address, PO box, City, State, ZIP or Postal Code. Splitting the data in such a manner allows them to be grouped by zip or postal code, city, state, and so on. Sales figures might need to be aggregated or split according to a variable, so key stakeholders can make decisions for the future. Each of these individual transformations have to be developed individually and in response to feature requests. In doing so, the CSV importer becomes an internal product that needs to be responsive to internal changes, and requires both maintenance and updates.

Limited Standardization

Despite efforts to standardize over the years, CSV variants abound. There are not only numerous encodings still in use (Unicode and UTF-8 for example), but also slight variants that are still considered by some to be CSV files, notably .TSV (Tab-Separated Values) and semi-colon and pipe delimited files. Strictly speaking, only commas are supported as delimiters by the CSV specification, these alternative formats are common, and typically lumped in with CSV as a format due to their similar logical structure. It is important then, if building a CSV importer to communicate the standards you will recognize, as well as any variants, including spreadsheet files. Otherwise, you will never know what curve balls customers or other teams/organizations might attempt to send across home plate. If not explicitly defined, this may lead to silent bugs emerging. Also, the more variations your library can support, the more likely it is to be poorly optimized, and will perform badly for specific implementations of a CSV file.

Why Build Your Own?

With the complexities inherent in building a CSV importer in-house, it can quickly become an unhealthy resource and time sink on any organization, particularly smaller ones. A good CSV importer is in reality a separate product/service that you must maintain and support, in addition to the original development time and resources. This time sink can cross multiple levels, involving sales and support calls, as well as development.

Rather than devoting a significant portion of your resources to developing a function or service that is not core to your solution, why not leverage a white-label CSV importer that can integrate fully with your web page or application? Flatirons Fuse offers just that, a plug and play CSV importer that will allow you to keep your focus on core deliverables, while ensuring the best possible experience for your customers when onboarding data. Flatirons Fuse handles CSV data validation on both the front and back-end, can manage both basic and custom data transformations according to your individual needs, and supports numerous file formats, including .TSV, .CSV, XLS, and XLSX. Best of all, it is free to try, and free forever if you do not exceed 10,000 records a month.

Take a look at their website, and give their product a no-risk, all-reward whirl.

Flatirons Fuse Application: Flatirons Development

Ashvin Nihalani

San Francisco, CA

Education: B. Eng, EECS, University of California

Originally from Texas. Graduated from Berkeley with an B.Eng in EECS. Interested in basically anything, well anything interesting. More recently focused on Machine Learning, Blockchain, and Embedded Systems.