CerebralCortex Data importer

Submodules

cerebralcortex.data_importer.ingest module

import_dir(cc_config: dict, input_data_dir: str, user_id: str = None, data_file_extension: list = [], allowed_filename_pattern: str = None, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, batch_size: int = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None, gen_report: bool = False)[source]

Scan data directory, parse files and ingest data in cerebralcortex backend.

Parameters:
  • cc_config (str) – cerebralcortex config directory
  • input_data_dir (str) – data directory path
  • user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
  • data_file_extension (list[str]) – (optional) provide file extensions (e.g., .doc) that must be ignored
  • allowed_filename_pattern (str) – (optional) regex of files that must be processed.
  • allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
  • ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
  • batch_size (int) – (optional) using this parameter will turn on spark parallelism. batch size is number of files each worker will process
  • compression (str) – pass compression name if csv files are compressed
  • header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
  • metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
  • metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
  • data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
  • gen_report (bool) – setting this to True will produce a console output with total failures occurred during ingestion process.

Notes

Each csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. Metadata files should be json files.

Todo

Provide sample metadata file URL

import_file(cc_config: dict, user_id: str, file_path: str, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None)[source]

Import a single file and its metadata into cc-storage.

Parameters:
  • cc_config (str) – cerebralcortex config directory
  • user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
  • file_path (str) – file path
  • allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
  • ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
  • compression (str) – pass compression name if csv files are compressed
  • header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
  • metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
  • metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
  • data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
  • Notes
  • csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. (Each) –
  • files should be json files. (Metadata) –
Returns:

False in case of an error

Return type:

bool

print_stats_table(ingestion_stats: dict)[source]

Print import data stats in table.

Parameters:ingestion_stats (dict) – basic import statistics. {“fault_type”: [], “total_faults”: []}
save_data(df: object, cc_config: dict, user_id: str, stream_name: str)[source]

save dataframe to cc storage system

Parameters:
  • df (pandas) – dataframe
  • cc_config (str) – cerebralcortex config directory
  • user_id (str) – user id
  • stream_name (str) – name of the stream

cerebralcortex.data_importer.main module

Module contents

import_file(cc_config: dict, user_id: str, file_path: str, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None)[source]

Import a single file and its metadata into cc-storage.

Parameters:
  • cc_config (str) – cerebralcortex config directory
  • user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
  • file_path (str) – file path
  • allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
  • ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
  • compression (str) – pass compression name if csv files are compressed
  • header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
  • metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
  • metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
  • data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
  • Notes
  • csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. (Each) –
  • files should be json files. (Metadata) –
Returns:

False in case of an error

Return type:

bool

import_dir(cc_config: dict, input_data_dir: str, user_id: str = None, data_file_extension: list = [], allowed_filename_pattern: str = None, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, batch_size: int = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None, gen_report: bool = False)[source]

Scan data directory, parse files and ingest data in cerebralcortex backend.

Parameters:
  • cc_config (str) – cerebralcortex config directory
  • input_data_dir (str) – data directory path
  • user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
  • data_file_extension (list[str]) – (optional) provide file extensions (e.g., .doc) that must be ignored
  • allowed_filename_pattern (str) – (optional) regex of files that must be processed.
  • allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
  • ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
  • batch_size (int) – (optional) using this parameter will turn on spark parallelism. batch size is number of files each worker will process
  • compression (str) – pass compression name if csv files are compressed
  • header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
  • metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
  • metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
  • data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
  • gen_report (bool) – setting this to True will produce a console output with total failures occurred during ingestion process.

Notes

Each csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. Metadata files should be json files.

Todo

Provide sample metadata file URL