SpikeData

This is the documentation for the SpikeData which loads spiking data from Phy files. This class calculates raw refractory period violations, quality metrics, and loads raw waveforms.

Initializing SpikeData

Similarly to StimulusData, a file_path should be given which holds the various *.npy files that are created post-Phy curation. For Windows this may need to prepended by an “r” due to escaping.

from spikeanalysis import SpikeData
spikes = SpikeData(file_path = "path/to/data")

For Windows I recommend

spikes = SpikeData(file_path = r"path\to\data") # prevents improper escaping

Saving Data

Upon loading a data directory the CACHING state is False for each SpikeData object. This is for the case where the user is memory constrained. In order to save data to be able to reload qc values and waveform values in the future the CACHING should be set to true. This is done with set_caching

spikes.set_caching(cache=True) # files will be saved

If the user decides after this they don’t want to save data or they are just doing exploratory analysis the same method can be used to switching caching to False

spikes.set_caching(cache=False) # files will not be saved

Caching the data is done as either json or .npy depending on whether the data being saved is a dict or a numpy.array, respectively.

Refractory violations

A simple calcuation of refractory period violations for each unit based on the ref_dur_ms given. Since neurons have refractory periods in which they can’t fire units with too many violations are likely poorly separated or mis-curated. Calculated with the function refractory_violation.

spikes.refractory_violation(ref_dur_ms = 2.0)

Generating PCs

Phy has principal components files, but these files do not update after curation, so the function generate_pcs will generate the new PC values based on the manual curation. Once the post-curation PCs have been generated both Isolation Distance (ref) as well as Silhouette Score (ref) can be calculated with generate_qcmetrics.

spikes.generate_pcs() # organize curated data
spikes.generate_qcmetrics() # Isolation distance and Silhouette Score

Creating a quality control threshold

Once refractory period violations and qcmetrics have been calculated, a qc threhold can be generated so that only higher quality units are assessed. Importantly there are no strict cutoffs for these values and they should be determined based on the type of analysis and neuronal populations. The mask is generated with qc_preprocessing. Importantly to overwrite a previous qc mask the recurated can be set to True. An additional layer of masking is provided by the denoise_data, which removes any units labeled as noise in Phy. Finally the mask can be applied to the dataset by setting it with set_qc.

spikes.qc_preprocessing(idthres = 10, rpv = 0.01, sil=0.45)
spikes.set_qc()

Isolation Distance

Relies on the mahalobnis distance between clusters as a metric of clustering quality. Since this metric utilizes the covariance matrix of cluster distances it helps reduce less significant and highly coordinated PC spaces to reduce the curse of dimensionality as well as the fact that many contacts of the probes are in similar locations and so should have correlated PC spaces. The Isolation Distance relies on the mahalobnis distances and is reported as the smallest mahalobnis distance of the nearest spike not found in the current cluster. Proposed by Harris et al (2001) and equation adapted from Schmitzer-Torbert et al (2005). Isolation distances can vary from \(0\) to \(\infty\) with greater distance indicating that clusters are farther apart in PC space and thus likely cleaner clusters.

\[{D^2}_{i,C} = (x_i - \mu_{c})^T \Sigma_c^{-1}(x_i - \mu_C)\]

Silhouette Score

Silhouette Score is another metric of clustering quality that seeks to determine the goodness of clustering. It is a metric that assesses the pointwise distances between every spike within a cluster compared to every other spike in the cluster as well as every other spike in the nearest other cluster. The basic idea is that for each spike we can determine whether it fits better within its assigned cluster \(a(i)\) or whether it would have better distance scores to the neighboring cluster \(b(i)\):

\[ \begin{align}\begin{aligned}a(i) = \frac{1}{|C_K| - 1} \Sigma_{x \in C_K, x \neq i} distance(i, x)\\b(i) = \frac{1}{|C_L|} \Sigma_{x \in C_L} distance(i, x)\end{aligned}\end{align} \]

The silhouette score is than:

\[s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}\]

For spikenalysis, rather than this implementation proposed by Rousseeuw, the simplified silhouette is used as proposed by Hruschka et al. This makes use of the centroid distance rather than pairwise. So,

\[distance(i, \mu_{C_K})\]

The general interpretation of the silhouette score is that \(-1\) would indicate that a spike was placed in the wrong the cluster whereas a score of \(1\) is that a spike was put into the correct cluster. We can take the average of all these scores to get the average silhouette score to indicate the overall quality of a cluster. \(\frac{1}{n spikes} \Sigma s(i)\) Thus similar to the per-spike basis the per-cluster score can vary from \(-1\) (bad cluster) to \(1\) (great cluster) with intermediate values.

Denoising Data

Phy allows for the labeling of the curated data. spikeanalysis only uses one of these labels: noise. The goal is to remove multiunit and have only “good” units, which in spikeanalysis is done with the pc_metrics and the refractory period violations. But certain types of artifacts (ie optogenetic stimulus) artifacts can actually have great qc metrics since they are so distinct from the “good” units. So in order to remove these high-qc, but artifact-based units, you add a noise label in Phy (see Phy instructions) and then run the helper function denoise_data to remove anything you want to be removed regardless of quality values.

spikes.denoise_data() # remove units labeled as Phy noise

Denoising data alters the values of spike_clusters, raw_spike_times, etc, so if testing multiple iterations of different qc metrics one should use reload_data to “un”-denoise the data. This allows applying new qc masks to the data while iterating through values. For this reason the run_all (see below) does not run denoise_data. But do note that SpikeAnalysis does run denoise_data automatically, so if iterating with SpikeAnalysis one should run reload_data

spikes.reload_data()

Raw waveforms

Although Phy has templates of each unit sometimes it is beneficial to analyze the raw waveforms of a neuron. This can be accomplished reading the raw waveforms with the function get_waveforms. The user can specificy the number of samples around the spike time to load (Phy shows 82 samples so this the default) and the number of waveforms can be specified with n_wfs. The waveforms will be saved as .json if set_caching has been run. Once the raw waveforms have been loaded some common values (depth, amplitude half-width) can be calculated with get_waveform_values. The depth can be optionally specified for depths to be real depth in tissue. If this isn’t given then depths are given as distance from the “0” contact of the probe.

spikes.get_waveforms()
spikes.get_waveform_values(depth=1000)

Waveform depth

Since sorting is performed relative to the channel_map, the raw depth of each channel are given relative to the \(0\). Since this is typically the bottom of the probe the values are at the relative to the bottom of the probe. This can easily be corrected. Determining the depth of a spike, though can be calculated a variety of different ways. Currently to keep the code as understandable and maintainable a weighted average is used (faster, slightly less accurate) rather than a PC based approach (slower, slightly more accurate)

\[\frac{1}{\Sigma amplitudes} \sum{amplitudes * depths}\]

Waveform amplitudes

Since neurons should always have the same amplitude we can assess the variation in amplitudes as a measure of the quality of a neuron. We expect a rather gaussian distribution of amplitudes so using the get_amplitudes() function we assess how many spikes fall within a certain std of the waveform data.

spikes.get_amplitudes(std=2) # 2 std devs

Pipeline Function

For users wanting to use all the functionality of SpikeData an easy to use pipeline will run all functions automatically. (This also means the user doesn’t need to remember a bunch of function names.) This function is called run_all and will request all parameters to be provided. Example below will all values included.

spikes.run_all(
    ref_dur_ms=2, # 2 ms refractory period
    idthres=20, # isolation distance 20--need an empiric number from your data
    rpv=0.02, # 2% the amount of spikes violating the 2ms refractory period allowed
    sil=0.45, # silhouette score (-1,1) with values above 0 indicates better and better clustering
    amp_std=2 # number of std deviations above mean waveform amplitude to look at
    amp_cutoff=0.98, # percent of neurons which must fall within amp_std deviations of mean waveform
    recurated= False, # I haven't recurated my data
    set_caching = True, # I want to save data for future use
    depth= 500, # probe inserted 500 um deep
)

References

See references page.