Sunday, July 7, 2013

A CSV-to-CSV converter using Symfony Console application skeleton

A common problem when dealing with received CSV files is that often its structure and/or contents need to be converted in order to be fed to some existing application. A real life example is when we have is a list of items to be imported into an online shop (sent by a provider using a predefined CSV structure), and we discover that the import module of our online shop needs another predefined CSV structure and/or contents. We could solve the problem by adapting our import software, but then we will be facing the same problem again when the next provider sends its predefined format...

So instead we will write a generic CSV-to-CSV converter using the our Symfony Console Skeleton to solve the problem.

First things first

  • Clone the project: git clone https://github.com/magabriel/Csv2Csv and you are all set to start customizing the code to suit your needs.

This post will walk you through the implementation of the CLI skeleton for a real case scenario and how the CSV-to-CSV conversion is performed.

Layouts

Our example input file has the following look:

ISBN,Title,Author,Price EUR,Release date
9123456789012,The title of the book,Author name,12.34,2013-02-01
9123456789013,"The title of the book, with a comma",Another author name,34.56,2012-01-02

It is an standard CSV file with comma as field delimiter and double quote as optional field enclosure. First line contains column headers (field names).

Our goal is to convert its contents to this:

product_sku;product_name;product_price;product_category;product_year
9123456789012;The title of the book;13.57;Computers;2013
9123456789013;The title of the book, with a comma;38.02;Computers;2012

Just another "standard" CSV file, but this time with semicolon as delimiter (and let's assume double quote as field enclosure). Please note that it has a field not found in the input file (book_category) with value "Computers" (let's pretend that all books in the input file have this implicit categorization) and that the book_price has the input price incremented by a 10%.

And, to make things even more interesting, let's pretend our input file is encoded in ISO-8859-1 and we want the output file to be UTF-8.

Converters

Let's start creating a BaseConverter class that will provide the basic functionality:

    //myproject/src/MyNamespace/Converter/BaseConverter.php
    <?php

    namespace MyNamespace\Converter;

    abstract class BaseConverter
    {
        protected $inputFileName;
        protected $outputFileName;
        protected $results = array();

        /**
         * The input file name
         * @param string $inputFileName
         */
        public function setInputFileName($inputFileName)
        {
            $this->inputFileName = $inputFileName;
        }

        /**
         * The output file name
         * @param string $outputFileName
         */
        public function setOutputFileName($outputFileName)
        {
            $this->outputFileName = $outputFileName;
        }

        /**
         * Retrieve the conversion results
         * @return array
         */
        public function getResults()
        {
            return $this->results;
        }
    }

The class will have methods getInputFileDefinition() and getOutputFileDefinition() that must be overriden to provide a valid definition for both files:

    //myproject/src/MyNamespace/Converter/BaseConverter.php
    <?php

    namespace MyNamespace\Converter;

    abstract class BaseConverter
    {
      // ...
      
      /**
       * Override to provide the right definition
       * @return array
       */
      protected function getInputFileDefinition()
      {
          $definition = array();

          $definition['fields'] = array(
          // 'field_name1', 'field_nameN'
          );

          $definition['enclosure'] = '"';
          $definition['delimiter'] = ',';

          $definition['encoding'] = 'UTF-8';
          $definition['has-headers'] = true;

          return $definition;
      }      
    
      /**
       * Override to provide the right definition
       * @return array
       */
      protected function getOutputFileDefinition()
      {    
        // same as above
      }
      
      // ...
      
    }

$definition['fields'] contains all the fields names, while the rest of $definition array contains the other values needed to completely define the file format.

Now we will create a convert() method:

    //myproject/src/MyNamespace/Converter/BaseConverter.php
    <?php

    namespace MyNamespace\Converter;

    abstract class BaseConverter
    {
      // ...
      
      /**
       * Perform the conversion
       * @return boolean Success
       */
      public function convert()
      {
          // get definitions
          $inputDef = $this->getInputFileDefinition();
          $outputDef = $this->getOutputFileDefinition();

          // init input file
          $fInput = new \SplFileObject($this->inputFileName);
          $fInput->setFlags(\SplFileObject::READ_CSV);
          $fInput->setCsvControl($inputDef['delimiter'], $inputDef['enclosure']);

          // init output file
          @unlink($this->outputFileName);
          $fOutput = new \SplFileObject($this->outputFileName, 'w');
          $fOutput->setCsvControl($outputDef['delimiter'], $outputDef['enclosure']);

          // write headers if set
          if ($outputDef['has-headers']) {
              $outRecord = $outputDef['fields'];
              $fOutput->fputcsv($outRecord);
          }

          // process input file as records
          $count = 0;
          $headers = false;
          foreach ($fInput as $record) {

              // empty record == EOF
              if (!$record[0]) {
                  break;
              }

              // skip headers line
              if ($inputDef['has-headers'] && !$headers) {
                  $headers = true;
                  continue;
              }

              // complete missing input fields with blanks
              if (count($inputDef['fields']) > count($record)) {
                  for ($i = count($record); $i < count($inputDef['fields']); $i++) {
                      array_push($record, '');
                  }
              }

              // call record converter and write results
              $inRecord = array_combine($inputDef['fields'], $record);
              $inRecord = $this->convertEncoding(
                  $inRecord,
                  $inputDef['encoding'],
                  $outputDef['encoding']
              );
              $outRecord = $this->convertRecord($inRecord);
              
              // ensure the output file record has the fields in the right order
              $outRecord = $this->reorderRecordFields($outputDef['fields'], $outRecord);
              
              $fOutput->fputcsv($outRecord);

              $count++;
          }

          $this->results['records.count'] = $count;

          return true;
      }


      /**
       * Reorder record fields to match definition
       * @param array $recordFields
       * @param array $record
       */
      protected function reorderRecordFields(array $recordFields, array $record)
      {
          $recordReordered = array();
      
          foreach ($recordFields as $field) {
              if (isset($record[$field])) {
                  $recordReordered[$field] = $record[$field];
              } else {
                  $recordReordered[$field] = '';
              }
          }
      
          return $recordReordered;
      }
            
      // ...
      
    }

The last bit is the convertEnconding() method, needed to perform the character encoding conversions:

    //myproject/src/MyNamespace/Converter/BaseConverter.php
    <?php

    namespace MyNamespace\Converter;

    abstract class BaseConverter
    {
      // ...
      
      /**
       * Convert the record to the right encoding
       * @param array $record
       * @param string $fromEncoding
       * @param string $toEncoding
       */
      protected function convertEncoding(array $record, $fromEncoding, $toEncoding)
      {
          array_walk(
              $record,
              function (&$field, $key) use ($fromEncoding, $toEncoding) {
                  $field = mb_convert_encoding($field, $toEncoding, $fromEncoding);
              }
          );

          return $record;
      }
    
      // ...
      
    }

MyInputFileConverter

Now we are ready to define our converter class by extending BaseConverter:

    //myproject/src/MyNamespace/Converter/MyInputFileConverter.php
    <?php

    namespace MyNamespace\Converter;

    /**
     * Converter for my input file
     */
    class MyInputFileConverter extends BaseConverter
    {
        protected $markUp = 0.1; // use a setter in a real life scenario

        protected function getInputFileDefinition()
        {
            $definition = array();

            $definition['fields'] = array(
                    'ISBN',
                    'Title',
                    'Author'
                    'Price EUR',
                    'Release date'
            );

            $definition['enclosure'] = '"';
            $definition['delimiter'] = ',';

            $definition['encoding'] = 'ISO-8859-1';
            $definition['has-headers'] = true;

            return $definition;
        }

        protected function getOutputFileDefinition()
        {
            $definition = array();

            $definition['fields'] = array(
                    'product_sku',
                    'product_name',
                    'product_price',
                    'product_category',
                    'product_year'
            );

            $definition['enclosure'] = '"';
            $definition['delimiter'] = ';';

            $definition['encoding'] = 'UTF-8';
            $definition['has-headers'] = true;

            return $definition;
        }
        
        // ...
    }

We also need to define the convertRecord() method that will convert one record at a time:

    //myproject/src/MyNamespace/Converter/MyInputFileConverter.php
    <?php

    namespace MyNamespace\Converter;

    /**
     * Converter for my input file
     */
    class MyInputFileConverter extends BaseConverter
    {
        // ....
        
        protected function convertRecord(array $record)
        {
            $out = array();

            $out['product_sku'] = $record['ISBN'];
            $out['product_name'] = $record['Title'];
            $out['product_category'] = 'Computers';            
            $out['product_price'] = $record['Price EUR'] * (1 + $this->$markUp);
            $date = new \DateTime($record['Release date']);
            $out['product_year'] = $date->format('Y');

            return $out;
        }
    }

As you can see, input values are retrieved by field name from $record argument, and assigned to an $out array also by field name. The convert() base class method will take care of creating both associative arrays and populating / extracting its values, and perform the conversion between encodings.

All in all, it makes a very clear and readable conversion process. And it's testable, too!!!

Command

So we jut need to define a console command to call our converter. Let's use also a base class that provides the basic functionality:

    //myproject/src/MyNamespace/Console/Command/BaseConverter.php
    <?php

    namespace MyNamespace\Console\Command;

    use Symfony\Component\Console\Command\Command;
    use Symfony\Component\Console\Input\InputArgument;
    use Symfony\Component\Console\Input\InputInterface;
    use Symfony\Component\Console\Input\InputOption;
    use Symfony\Component\Console\Output\OutputInterface;

    /**
     * Base for converter commands
     */
    abstract class BaseConvertCommand extends Command
    {
        protected function configure()
        {
            // add common arguments and options
            $this->addArgument('inputFile', InputArgument::REQUIRED, 'Input CSV file');
            $this->addArgument('outputFile', InputArgument::REQUIRED, 'Output CSV file');
            $this->addOption('force', 'f', InputOption::VALUE_NONE, 'Force execution');
        }

        protected function execute(InputInterface $input, OutputInterface $output)
        {
            $inputFile = $input->getArgument('inputFile');
            $outputFile = $input->getArgument('outputFile');

            // check input and output files
            if (!file_exists($inputFile)) {
                $output->writeln(sprintf('Error: Input file "%s" not found', $inputFile));
                return 1;
            }

            if (file_exists($outputFile) && !$input->getOption('force')) {
                $dialog = $this->getHelperSet()->get('dialog');
                if (!$dialog
                    ->askConfirmation(
                        $output,
                        sprintf(
                            'Output file "%s" already exists. Overwrite? [y/N] ',
                            $outputFile
                        ),
                        false
                    )
                        ) {
                    $output->writeln('Halted');
                    return 1;
                }
            }
        }
        
        // ...
    }

And now we can extend it:

    //myproject/src/MyNamespace/Console/Command/MyInputFileConverterCommand.php
    <?php

    namespace MyNamespace\Console\Command;

    use Symfony\Component\Console\Command\Command;
    use Symfony\Component\Console\Input\InputArgument;
    use Symfony\Component\Console\Input\InputInterface;
    use Symfony\Component\Console\Input\InputOption;
    use Symfony\Component\Console\Output\OutputInterface;

    /**
     * Converts the CSV books file
     */
    class MyInputFileConverterCommand extends BaseConvertCommand
    {
        protected function configure()
        {
            parent::configure();

            $this->setName('convert:myinputfile')
                 ->setDescription('Converts myinputfile format to myoutputfile format')
                 ->setHelp('Performs the conversion.');
        }

        protected function execute(InputInterface $input, OutputInterface $output)
        {
            parent::execute($input, $output);

            $converter = new MyInputFileConverter();

            $converter->setInputFileName($input->getArgument('inputFile'));
            $converter->setOutputFileName($input->getArgument('outputFile'));
            $ret = $converter->convert();

            if (!$ret) {
                $output->writeln('Finished with error');
                return 1;
            }

            $results = $converter->getResults();
            $output->writeln(
                sprintf(
                    'Completed. %s records written (not counting headings)',
                    $results['records.count']
                )
            );
        }
    }

Execution

We can run the conversion using

php run.php convert:myinputfile inputfile.csv outputfile.csv

Wrap up

We now have a converter that its easily extensible and maintainable, and that can be compiled into a phar for easy distribution thanks to our CLI skeleton. And it was fun to develop, too!!

As mentioned above, the converter class is easily testable (just instantiate it, no need to mock anything), so its probably a good idea to add unit tests to your conversion process.

Edited 8-8-2013

Added the BaseConverter::reorderRecordFields() medthod to handle the case when the output fields are not fille up in the same order that defined.

Also uploaded as a fully working GitHub project: https://github.com/magabriel/Csv2Csv.

1 comment:

  1. Sadly, although PDF information are very helpful and handy, it can't be edited. As soon as the file is opened, the one factor that an individual can do is to learn it. That is one of many the reason why these information dramatically lower in total measurement. onlineconvertfree.com

    ReplyDelete