CerebralCortex Data importer¶
Subpackages¶
Submodules¶
cerebralcortex.data_importer.ingest module¶
-
import_dir
(cc_config: dict, input_data_dir: str, user_id: str = None, data_file_extension: list = [], allowed_filename_pattern: str = None, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, batch_size: int = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None, gen_report: bool = False)[source]¶ Scan data directory, parse files and ingest data in cerebralcortex backend.
Parameters: - cc_config (str) – cerebralcortex config directory
- input_data_dir (str) – data directory path
- user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
- data_file_extension (list[str]) – (optional) provide file extensions (e.g., .doc) that must be ignored
- allowed_filename_pattern (str) – (optional) regex of files that must be processed.
- allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
- ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
- batch_size (int) – (optional) using this parameter will turn on spark parallelism. batch size is number of files each worker will process
- compression (str) – pass compression name if csv files are compressed
- header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
- metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
- metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
- data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
- gen_report (bool) – setting this to True will produce a console output with total failures occurred during ingestion process.
Notes
Each csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. Metadata files should be json files.
Todo
Provide sample metadata file URL
-
import_file
(cc_config: dict, user_id: str, file_path: str, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None)[source]¶ Import a single file and its metadata into cc-storage.
Parameters: - cc_config (str) – cerebralcortex config directory
- user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
- file_path (str) – file path
- allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
- ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
- compression (str) – pass compression name if csv files are compressed
- header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
- metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
- metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
- data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
- Notes –
- csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. (Each) –
- files should be json files. (Metadata) –
Returns: False in case of an error
Return type: bool
cerebralcortex.data_importer.main module¶
Module contents¶
-
import_file
(cc_config: dict, user_id: str, file_path: str, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None)[source]¶ Import a single file and its metadata into cc-storage.
Parameters: - cc_config (str) – cerebralcortex config directory
- user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
- file_path (str) – file path
- allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
- ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
- compression (str) – pass compression name if csv files are compressed
- header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
- metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
- metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
- data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
- Notes –
- csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. (Each) –
- files should be json files. (Metadata) –
Returns: False in case of an error
Return type: bool
-
import_dir
(cc_config: dict, input_data_dir: str, user_id: str = None, data_file_extension: list = [], allowed_filename_pattern: str = None, allowed_streamname_pattern: str = None, ignore_streamname_pattern: str = None, batch_size: int = None, compression: str = None, header: int = None, metadata: cerebralcortex.core.metadata_manager.stream.metadata.Metadata = None, metadata_parser: Callable = None, data_parser: Callable = None, gen_report: bool = False)[source]¶ Scan data directory, parse files and ingest data in cerebralcortex backend.
Parameters: - cc_config (str) – cerebralcortex config directory
- input_data_dir (str) – data directory path
- user_id (str) – user id. Currently import_dir only supports parsing directory associated with a user
- data_file_extension (list[str]) – (optional) provide file extensions (e.g., .doc) that must be ignored
- allowed_filename_pattern (str) – (optional) regex of files that must be processed.
- allowed_streamname_pattern (str) – (optional) regex of stream-names to be processed only
- ignore_streamname_pattern (str) – (optional) regex of stream-names to be ignored during ingestion process
- batch_size (int) – (optional) using this parameter will turn on spark parallelism. batch size is number of files each worker will process
- compression (str) – pass compression name if csv files are compressed
- header (str) – (optional) row number that must be used to name columns. None means file does not contain any header
- metadata (Metadata) – (optional) Same metadata will be used for all the data files if this parameter is passed. If metadata is passed then metadata_parser cannot be passed.
- metadata_parser (python function) – a parser that can parse json files and return a valid MetaData object. If metadata_parser is passed then metadata parameter cannot be passed.
- data_parser (python function) – a parser than can parse each line of data file. import_dir read data files as a list of lines of a file. data_parser will be applied on all the rows.
- gen_report (bool) – setting this to True will produce a console output with total failures occurred during ingestion process.
Notes
Each csv file should contain a metadata file. Data file and metadata file should have same name. For example, data.csv and data.json. Metadata files should be json files.
Todo
Provide sample metadata file URL