pulsarpy_to_encodedcc.dcc_submit¶
- Required environment variables
- Those that are required in the pulsarpy.models module for connecting to Pulsar LIMS: -PULSAR_API_URL -PULSAR_TOKEN
- Those that are required in the encode_utils.connection module to connect to the ENCODE Portal: -DCC_API_KEY -DCC_SECRET_KEY
- Optional environment variables:
- DCC_MODE - Specifies which ENCODE Portal host to connect to. If not set, then must be provided when instantiating the Submit() class.
-
exception
pulsarpy_to_encodedcc.dcc_submit.IpLaneException[source]¶ Bases:
ExceptionRaised when posting an IP. The is a temporary class that’ll be removed once the exceptions are handled property in the Submit.post_ip_lane method.
-
exception
pulsarpy_to_encodedcc.dcc_submit.ExpMissingReplicates[source]¶ Bases:
ExceptionRaised when trying to POST an experiment to the Portal (such as a control experiment) and there aren’t any replicates (Biosample records) to attach to it.
-
exception
pulsarpy_to_encodedcc.dcc_submit.MissingTargetUpstream[source]¶ Bases:
ExceptionRaised when submitting a record that tries to link to a DCC target, but the target record in Pulsar doesn’t have the upstream_identifier attribute set.
-
exception
pulsarpy_to_encodedcc.dcc_submit.NoFastqFile[source]¶ Bases:
ExceptionRaised in Submit.post_fastq_file() when submitting either a R1 FASTQ file or a R2 FASTQ file, and the filepath isn’t set in the corresponding SequencingResult record in Pulsar.
-
class
pulsarpy_to_encodedcc.dcc_submit.Submit(dcc_mode=None, extend_arrays=True)[source]¶ Bases:
objectContains methods for submitting various types of objects in Pulsar to the ENCODE Portal.
-
extend_arrays= None¶ When patching, there is the option to extend array properties or overwrite their values. The default is to extend.
-
sanitize_prop_val(txt)[source]¶ Replaces characters that can be problematic in property values on the ENCODE Portal. For example, the ‘/’ character in an alias is a problem since the alias is an identifying property that can be used in a URL to view the record. In this case, the ‘/’ will be interpreted as a path separator.
Characters that get replaced: currently, just ‘/’ with ‘-‘.
Parameters: txt – str. The value to clean. Returns: str. The cleaned value that is submission acceptable.
-
get_vendor_id_from_encodeportal(pulsar_vendor_id)[source]¶ Given a Pulsar Vendor record ID, returns the upstream identifier.
Raises: UpstreamNotSet – The Pulsar vendor.upstream_identifier attribute isn’t set.
-
patch(payload, upstream_id, dont_extend_arrays=False)[source]¶ A wrapper over encode_utils.connection.Connection.patch().
Parameters: dont_extend_arrays – bool. Dynamic way to signal not to extend array property values. If not True, then the boolean value of self.extend_arrays determines whether arrays are extended. Returns: - The JSON response from the PATCH operation, or an empty dict if the record doesn’t
- exist on the Portal. See
encode_utils.connection.Connection.patch()for more details.
Return type: dict
-
post(payload, dcc_profile, pulsar_model, pulsar_rec_id)[source]¶ A wrapper over encode_utils.connection.Connection.post().
First checks if the Pulsar record has an upstream_identifier set, and if set, returns it rather than attempting to re-post.
Adds aliases to the payload being the record’s record ID and name.
Sets the profile key in the payload.
If the record is successfully posted to the prod ENCODE Portal, then sets the upstream_identifier attribute in the Pulsar record.
Parameters: - payload – dict. The new record attributes to submit.
- dcc_profile – str. The name of the ENCODE Profile for this record, i.e. ‘biosample’, ‘genetic_modification’.
- pulsar_model – One of the defined subclasses of the
models.Modelclass, i.e.models.Model.Biosample, which will be used to set the Pulsar record’s upstream_identifier attribute after a successful POST to the ENCODE Portal. - pulsar_rec_id – str. The identifier of the Pulsar record to POST to the DCC.
Returns: The upstream identifier for the new record on the ENCODE Portal, or the existing upstream identifier if the record already exists; see
encode_utils.utils.get_record_id()for more details.Return type: str
-
get_biosample_term_name_and_type(biosample)[source]¶ Creates a dict. with the keys:
biosample_term_name biosample_term_id biosample_typeParameters: biosample – pulsarpy.models.Biosample instance. Returns: dict.
-
post_library_through_fastq(pulsar_library_id, dcc_exp_id, patch=False)[source]¶ POSTS the Biosample, it’s latest Library, and all SequencingResults for that Library.
Parameters: - pulsar_library_id – int. The ID of a Pulsar Library record.
- dcc_exp_id – int. The ID of the experiment record on the Portal to link the replicate to.
-
post_sres(pulsar_sres_id, enc_replicate_id, patch=False)[source]¶ A wrapper over
self.post_fastq_file(). Whereasself.post_fastq_file()only uploads the FASTQ file for the given read number, this method callsself.post_fastq_file()twice potentially, once for each FASTQ file in the Pulsar SequencingResult. Thus, if paired-end sequencing was done,self.post_fastq_file()will be called twice to upload the forward and reverse reads FASTQ files.
-
check_if_biosample_has_exp_on_portal(dcc_biosample_id)[source]¶ Given a Portal biosample record ID, searches the Portal for associated experiment records. Any that are found are returned in a list.
Parameters: dcc_biosample_id – str. A biosample record identifier on the Portal. Returns: list of associated experiment records, where each is JSON-serialized. Raises: Exception – The biosample is linked to more than one experiment.
-
post_chipseq_ctl_exp(rec_id, wt_input=False, paired_input=False, exp_only=False, patch=False)[source]¶ Creates a control experiment record on the ENCODE Portal for either the paired-input control biosample(s) or the wild-type input biosample on the Pulsar ChipseqExperiment.
Parameters: - rec_id – int. ID of a ChipseqExperiment record in Pulsar.
- wt_input – bool. True means to make a control experiment on the Portal for the wild-type input biosample on the Pulsar ChipseqExperiment. Note that either this or the paired_input parameter must be set to True and not both.
- paired_input – bool. True means to make a control experiment on the Portal for the paired-input control biosample(s) on the Pulsar ChipseqExperiment. Note that either this or the wild_type parameter must be set to True and not both.
- exp_only – bool. Only makes sense to use when the patch parameter is set to True. When exp_only=True, then don’t PATCH Biosample records and everything downstream to the file records on the Portal (don’t call self.post_library_through_fastq()).
Returns: The ENCODE Portal accession of the control experiment.
Return type: str
Raises: ValueError – Both parameters wt_input and paired_input are set to False or True. Only one of them must be True.
-
post_bulk_atacseq_exp(rec_id, patch=False, patch_all=False)[source]¶ Parameters: - rec_id – int. ID of an AtacSeq experiment record in Pulsar. Should be a bulk and not a single-cell experiment.
- patch – bool. True means to patch the DCC experiment record.
- patch_all – bool. True means to patch not just the experiment record, but its sub-entities also, i.e. biosamples, libraries, replicates, … Setting this to True automatically sets patch to True as well.
Returns:
-
post_chipseq_exp(rec_id, patch=False)[source]¶ Parameters: rec_id – int. ID of a ChipseqExperiment record in Pulsar. Returns: The ENCODE Portal accession of the control experiment. Return type: str Raises: ValueError – Both parameters wt_input and paired_input are set to False or True. Only one of them must be True.
-
post_chipseq_control_experiments(rec_id)[source]¶ POSTS the WT input and the paired input controls that are associated to the indicated ChipseqExperiment in Pulsar, turning each into an experiment record on the Portal.
Parameters: rec_id – int. ID of a ChipseqExperiment record in Pulsar.
-
post_experimental_reps(rec_id, experiment_type, patch=False)[source]¶ POSTS the experimental replicates of a ChipseqExperiment or bulk Atacseq experiment object.
Parameters: - rec_id – int. ID of a ChipseqExperiment record in Pulsar.
- experiment_type – str. Either of chip-seq or atac-seq.
-
get_exp_core_payload_props(pulsar_exp_rec, assay_term_name)[source]¶ Parameters: - pulsar_exp_rec – str. pulsarpy.models subclass being either ChipSeq or Atacseq.
- assay_term_name – str. Either ‘ChIP-seq’ or ATAC-seq.
-
get_gel_lane_with_biosample(immunoblot_id, biosample_id)[source]¶ Given an Immunoblot record ID, and a Biosample record ID, returns the GelLane object with the given Biosample. This method assumes that a Gel won’t have more than one GelLane with the same Biosample.
Note that there should only be 1 Gel, even though the Rails Immunoblot model allows many - on the ‘to fix list’.
Parameters: - immunoblot_id – int. Immunoblot record ID.
- biosample_id – int. Biosample record ID.
Returns: None if the GelLane didn’t pass. Otherwise, a pulsarpy.models.GelLane instance.
Raises: IpLaneException – One of multiple issues that could be present as indicated by the error message, i.e.
- The Biosample doesn’t have an associated Gel
- There isn’t a GelLane with the Biosample on it.
-
post_ip_biosample_characterization(immunoblot_id, biosample_id, patch=False)[source]¶ Submits a Pulsar Immunoblot for a specific lane (biosample) on a Gel to the ENCODE biosample_characterization profile. Such an immunoblot is used to show whether the eGFP-tagged target (using CRISPR) is expressed (has a band in the size range of the expected taget size). Only submit these after the ChipSeq experiment (and hence CrisprModification) has been submitted. Even though some Biosamples have a successful IP, they don’t all need to be submitted. For example, in one case a Biosample was lost after a successful IP and hence couldn’t do the crosslinking for ChIP later on. Another reason may be that we already have enough validated Biosamples to submit.
This method makes the assumption that a given gel won’t have more than one lane with the same Biosample.
Returns: The Biosmaple isn’t already registered on the Portal. None: The Biosample has an IP, but not one that passes (based on the GelLane.pass attribute) None: The Non-WT Biosample isn’t yet registerd on the Portal None: The non-WT biosample that doesn’t have a ChipSeq object. int: The ID of the created biosample_characterization record on the Portal. Return type: None
-
post_library(rec_id, patch=False)[source]¶ This method will check whether the biosample associated to this library is submitted. If it isn’t, it will first submit the biosample.
-
post_replicate(pulsar_library_id, dcc_exp_id, patch=False)[source]¶ Submits a replicate record, linked to the specified library and experiment. First, replicates on the experiment will be searched to see if a replicate already exists for a specifc biosample and library combination, and if so then that repicate’s JSON from the ENCODE Portal is returned.
If the associated experiment is ChIP-seq, and isn’t a control experpiment, then the replicate will be submitted with a link to antibody ENCAB728YTO (AB-9 in Pulsar), which is the GFP-specific antibody used to pull down GFP-tagged TFs.
Parameters: - pulsar_library_id – int. The ID of a Library record in Pulsar.
- dcc_exp_id – int. The ID of the experiment record on the Portal to link the replicate to.
Returns: The replicate.uuid property value of the record on the ENCODE Portal.
Return type: str
-
post_fastq_file(pulsar_sres_id, read_num, enc_replicate_id, patch=False)[source]¶ Creates a file record on the ENCODE Portal. Checks the SequencingResult in Pulsar to see where the file is stored. If stored in DNAnexus, the file will be downloaded locally into the directory given by
pulsarpy_to_encodedcc.FASTQ_FOLDER(the download folder will be checked first to see if the file was previously downloaded before attempting to download.After the file object is created on the ENCODE Portal, it’s accession will be stored as the upstream identifier in the Pulsar SequencingResult record for the given read. Thus, if a file object was creatd for a R1 FASTQ file, then the SequencingResult.read1_upstream_identifier attribute is updated. If instead a file object was created for a R2 FASTQ file, then the SequencingResult.read2_upstream_identifier` attribute is updated.
Some rather complex logic is used to determine the control FASTQ files when submitting an experimental replicate’s FASTQ file. If the Biosample associated with the SequencingResult is part of a ChipseqExperiment, then the control biosamples consist of the paired input(s) and the wild type input, which in Pulsar are given the attribute names ChipseqExperiment.control_replicates and ChipseqExperiment.wild_type_control. A non-control file object on the ENCODE Portal needs to have the
controlled_byproperty set, which points to one or more control FASTQ file accessions on the ENCODE Portal. We normally submit them by matching read numbers, so if the file object we are creating is for a R1 FASTQ file, then all the controlled_by accessions are also R1 FASTQ files. The challenge is in knowing which SequencingResult set to use for control FASTQ files. Since a Biosample can have multiple Libraries, which can have multiple SequencingRequests, which can have multiple SequencingRuns, there can be many sets of SequencingResults. However, since in most cases there will only be one of each, the approach taken here is to use the SequencingResults of the latest SequencingRun of the latest SequencingRequest. Once this simplicity fails to hold, an updated approach will need to be taken.If you have alreay created the file record on the Portal and for some reason the FASTQs didn’t upload, you can try to reupload the FASTQs by calling this method with patch equal to False.
Parameters: - pulsar_sres_id – A SequencingResult record in Pulsar.
- read_num – int. being either 1 or 2. Use 1 for the forwrard reads FASTQ file, and 2 for the reverse reads FASTQ file. A SequencingResult in Pulsar stores the location of both files (if paired-end sequening).
- end_replicate_id – str. The identifier of the DCC replicate record that the file record is to be associated with.
Returns: dict. The response from the encode-utils POST or PATCH operation.
-
get_chipseq_controlled_by(pulsar_biosample, read_num, dcc_exp_id)[source]¶ Given a p :returns: The upstream identifiers for the control file objects on the ENCODE Portal. :rtype: list
-
get_barcode_details_for_ssc(ssc_id)[source]¶ This purpose of this method is to provide a value to the library.barcode_details property of the Library profile on the ENCODE Portal. That property taks an array of objects whose properties are the ‘barcode’, ‘plate_id’, and ‘plate_location’.
Parameters: ssc_id – The Pulsar ID for a SingleCellSorting record.
-