<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AC on AI]]></title><description><![CDATA[Adam Cataldo's thoughts on AI]]></description><link>https://aconai.dev</link><image><url>https://substackcdn.com/image/fetch/$s_!d5KF!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc801fb49-fb9c-481c-b837-f71c9c38bd5c_1024x1024.png</url><title>AC on AI</title><link>https://aconai.dev</link></image><generator>Substack</generator><lastBuildDate>Thu, 16 Apr 2026 20:43:32 GMT</lastBuildDate><atom:link href="https://aconai.dev/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Adam Cataldo]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[adamcataldo1@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[adamcataldo1@substack.com]]></itunes:email><itunes:name><![CDATA[Adam Cataldo]]></itunes:name></itunes:owner><itunes:author><![CDATA[Adam Cataldo]]></itunes:author><googleplay:owner><![CDATA[adamcataldo1@substack.com]]></googleplay:owner><googleplay:email><![CDATA[adamcataldo1@substack.com]]></googleplay:email><googleplay:author><![CDATA[Adam Cataldo]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Optimization in deep learning]]></title><description><![CDATA[Part 1: The challenge]]></description><link>https://aconai.dev/p/optimization-in-deep-learning</link><guid isPermaLink="false">https://aconai.dev/p/optimization-in-deep-learning</guid><dc:creator><![CDATA[Adam Cataldo]]></dc:creator><pubDate>Fri, 06 Jun 2025 19:00:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!h91D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Optimization is a challenging problem in deep learning. Let&#8217;s suppose you have a large training data set, that is fairly representative of data you might see in the wild when running inference. That alone is no small feat<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Now you need to choose a model architecture, train it, and hope it performs well on inference.  Choosing a model architecture is itself a non-trivial endeavor, but let&#8217;s say you have a good candidate architecture. Maybe you choose an architecture that&#8217;s done well on similar models. Maybe you even have a good foundation model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> that you&#8217;re fine-tuning. Training is fundamentally an optimization problem: find a set of model parameters that minimizes your training loss.</p><p>The primary goal of training to to pick the best parameters, with secondary goals to to find these parameters fast and with minimal resources. All three of these goals compete with one another. This is a hard problem. Ultimately, you&#8217;re using convex optimization techniques to solve a non-convex optimization problem. The loss surface you&#8217;re trying to find a minimum for might look kind of crazy; imagine something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h91D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h91D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 424w, https://substackcdn.com/image/fetch/$s_!h91D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 848w, https://substackcdn.com/image/fetch/$s_!h91D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 1272w, https://substackcdn.com/image/fetch/$s_!h91D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h91D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png" width="1273" height="924" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:924,&quot;width&quot;:1273,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:961080,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/164817286?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h91D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 424w, https://substackcdn.com/image/fetch/$s_!h91D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 848w, https://substackcdn.com/image/fetch/$s_!h91D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 1272w, https://substackcdn.com/image/fetch/$s_!h91D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde433b8-d387-4869-af23-8fcf78c41ec1_1273x924.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Note, this is just a fake objective function. Pavel, et al.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> created some visualizations of real objective functions in deep neural nets that make it clear this visualization isn&#8217;t particularly far off from objective functions you may encounter in the wild. Keep in mind though, that this objective function has only two variables. A neural net with millions, or even billions<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, of parameters can have types of complexity that are impossible to really visualize like this.</p><p>Just looking at this, it&#8217;s clear that the risk of converging to a local minimum is quite real. Converging to a local minimum isn&#8217;t necessarily a huge problem in and of itself. If a local minimum is close to a global minimum for example, the cost of getting it wrong might not be that high on inference, assuming the training data is fairly representative of the test. That said, even &#8220;good&#8221; local minimum can slow down training, since techniques like mini-batch gradient descent<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> are used in part to avoid getting stuck in local minimum. When the algorithm gets near a local minimum, it may hang out around there for a while before jumping out, since the gradient is flat their.</p><p>Saddle points are another problem. Like local minimum, they can slow down optimizers, due to their flat gradients. In the high-dimensional loss functions common in neural networks, saddle points are much more common than local minimum<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p><p>Yet another problem is &#8220;sharp&#8221; minima. Informally, a sharp minimum is one where a small change in parameters yields a large change in the loss function near the minimum. More formally, a sharp minimum is one where &#8711;<sup>2</sup>f(x<sub>min</sub>), the second-derivative matrix of the loss function at the minimum, has large positive eigenvalues<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. Because of this sensitivity, and because training data will never perfectly represent inference data, sharp minima degrade generalization quality.</p><p>In the next few posts, I&#8217;m going to explore techniques that can be used to mitigate some of these issues. As you&#8217;ll see, much of the standard deep learning toolkit is really focused on this problem. I&#8217;ll look at the role of batch size in mini-batch gradient descent, advanced optimizer techniques, normalization techniques, and other optimization innovations.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aconai.dev/subscribe?"><span>Subscribe now</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Northcutt, Curtis G., Anish Athalye, and Jonas Mueller. "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." <em>arXiv</em>, 26 Mar. 2021, <a href="https://arxiv.org/abs/2103.14749">https://arxiv.org/abs/2103.14749</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Bommasani, Rishi, et al. <em>"On the Opportunities and Risks of Foundation Models."</em>arXiv, 16 Aug. 2021, <a href="https://arxiv.org/abs/2108.07258">https://arxiv.org/abs/2108.07258</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Izmailov, Pavel, et al. <em>"Averaging Weights Leads to Wider Optima and Better Generalization."</em> <em>arXiv</em>, 28 Dec. 2017, <a href="https://arxiv.org/abs/1712.09913">https://arxiv.org/abs/1712.09913</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Roser, Max, et al. "Exponential Growth of Parameters in Notable AI Systems." <em>Our World in Data</em>, <a href="https://ourworldindata.org/grapher/exponential-growth-of-parameters-in-notable-ai-systems">https://ourworldindata.org/grapher/exponential-growth-of-parameters-in-notable-ai-systems</a>. Accessed 30 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in <em>Proceedings of the IEEE</em>, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'14), Vol. 2. MIT Press, Cambridge, MA, USA, 2933&#8211;2941.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Keskar, Nitish Shirish, et al. "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." <em>International Conference on Learning Representations (ICLR)</em>, 2017, <a href="https://arxiv.org/abs/1609.04836">https://arxiv.org/abs/1609.04836</a>.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Entropy explained]]></title><description><![CDATA[Disorderly conduct]]></description><link>https://aconai.dev/p/entropy-explained</link><guid isPermaLink="false">https://aconai.dev/p/entropy-explained</guid><dc:creator><![CDATA[Adam Cataldo]]></dc:creator><pubDate>Fri, 30 May 2025 03:04:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9ayq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Entropy is a concept that comes up repeatedly in AI. After recently watching <a href="https://youtu.be/qj7HH0PCqIE?si=mnnyLPI-as1KUl1g">a great video</a> on the history of information theory, I thought it would be cool to make a short explainer on entropy, both generally, and specific to AI/ML.</p><p>The idea of entropy was originally created in physics, defined as a measure of disorder and irreversibility in thermodynamic systems. Intuitively, entropy quantifies how dispersed or random the energy in a system is. The Second Law of Thermodynamics states that in an isolated system, one with no energy exchange with the environment, entropy tends to increase or remain constant. In other words, natural processes lead to greater disorder over time<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. A classic example is heat flowing from a hot object to a cold one: the energy spreads out and becomes less available to do work, increasing the overall entropy of the combined system.</p><p>In statistical mechanics, if a system can be in <em>microstate</em> i with probability p<sub>i</sub>&#8203;, then the <em>Gibbs entropy</em> is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S = -k_B \\sum_i p_i \\ln p_i ,&quot;,&quot;id&quot;:&quot;OISGRHNXUT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where k<sub>B</sub> is the Boltzmann constant<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. Here a microstate is the specific arrangement of the particles in the system: their positions and momentums. This Gibbs entropy formula is deeply connected to the concept of <em>information entropy</em>.</p><h1>Entropy in information theory</h1><p>In 1948, Claude Shannon introduced entropy in the context of information theory to quantify uncertainty in a data source<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. <em>Information entropy</em> measures the average information content or &#8220;surprise&#8221; of a random variable. For a discrete random variable X with possible outcomes {x<sub>0</sub>, x<sub>1</sub>, &#8230;} having corresponding probability {p<sub>0</sub>, p<sub>1</sub>, &#8230;}. The entropy H is defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H(p) = - \\sum_i p_i \\log_2 p_i&quot;,&quot;id&quot;:&quot;HPCPEAJZVB&quot;}" data-component-name="LatexBlockToDOM"></div><p>To get some intuition around this formula, consider the entropy of a coin toss, where the probability of flipping heads varies from 0 to 1:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9ayq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9ayq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 424w, https://substackcdn.com/image/fetch/$s_!9ayq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 848w, https://substackcdn.com/image/fetch/$s_!9ayq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 1272w, https://substackcdn.com/image/fetch/$s_!9ayq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9ayq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png" width="1456" height="1164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1164,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111087,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/164753346?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9ayq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 424w, https://substackcdn.com/image/fetch/$s_!9ayq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 848w, https://substackcdn.com/image/fetch/$s_!9ayq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 1272w, https://substackcdn.com/image/fetch/$s_!9ayq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b2b17-40e1-4ebf-a8cf-43111f06ec14_1702x1361.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Note that the entropy is greatest when heads and tails are equally likely. The entropy converges to zero whenever there is no randomness to the coin toss. Intuitively, there are more &#8220;surprises&#8221; in flipping an fair coin, than when flipping an unfair one.</p><p>Note that entropy provides a lower bound for the average number of bits required to transmit or store a sequence of events drawn from a probability distribution. Also, you can find an encoding that gets arbitrarily close to this lower bound. Consider the case where P(heads) = 2/3 and P(tails) = 1/3. The entropy in this case is roughly 0.918.  If you encode pairs of heads/tails sub-sequences using this encoding:</p><ul><li><p>heads, heads =&gt; 0</p></li><li><p>heads, tails =&gt; 10</p></li><li><p>tails, heads =&gt; 110</p></li><li><p>tails, tails =&gt; 111</p></li></ul><p>then average number of bits required to encode two symbols is 1.889, and the average number of bits per symbol is 0.944. You can reduce this number closer to the lower bound by increasing the size of the subsequence used in the encoding.</p><p>A related idea to entropy is <em>cross entropy</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. Let&#8217;s say the true probability distribution is p, but your estimate of the distribution is q. Then the cross entropy (also written with H) is defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H(p, q) = -\\sum_i p_i \\log_2 q_i&quot;,&quot;id&quot;:&quot;KYJUPOKGUI&quot;}" data-component-name="LatexBlockToDOM"></div><p>In information theory, cross entropy provides a lower bound for the average number of bits required to transmit or store a sequence of events drawn from a probability distribution p when you estimate the distribution is q. H(q, p) will always be greater than or equal to H(p), so there&#8217;s a cost to bad estimation. The difference between H(q, p) and H(p) is called the <em>Kullback&#8211;Leibler (KL) divergence</em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{eqnarray}\nD_{KL}(p || q) &amp;= H(p, q) - H(p) \\\\\n&amp;= \\sum_i p_i \\log_2 \\frac{p_i}{q_i}\n\\end{eqnarray}&quot;,&quot;id&quot;:&quot;YLUHIAJPGN&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aconai.dev/subscribe?"><span>Subscribe now</span></a></p><h1>Entropy in machine learning</h1><p>In machine learning, cross entropy is commonly used as a loss function for classification problems<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. In a classification problem with c classes, each output y is a vector of length c, where y<sub>i</sub> is 1 when the label belongs to class i, and 0 for all other entries. For example, an output might be y = (0, 0, 1, 0). An estimate of the output &#375; might be something like (0.1, 0.05, 0.8, 0.05). The cross-entropy loss function is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;l(y, \\hat{y}) = -\\sum_{i} y_i \\log_2 \\hat{y}_i&quot;,&quot;id&quot;:&quot;TNWFGHAUXO&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is the same as the cross-entropy function above replacing (p, q) with (y, &#375;). Just like in the information theory case, this function is minimized when y = &#375;.</p><p>To see an example of what this loss function looks like, let y = (1, 0), and &#375; = (a, 1-a), for 0 &lt; a &#8804; 1:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5tfz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5tfz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 424w, https://substackcdn.com/image/fetch/$s_!5tfz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 848w, https://substackcdn.com/image/fetch/$s_!5tfz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 1272w, https://substackcdn.com/image/fetch/$s_!5tfz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5tfz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png" width="1456" height="1192" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1192,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85555,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/164753346?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5tfz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 424w, https://substackcdn.com/image/fetch/$s_!5tfz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 848w, https://substackcdn.com/image/fetch/$s_!5tfz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 1272w, https://substackcdn.com/image/fetch/$s_!5tfz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebf33af-04f2-473d-8b4d-ac532f202243_1662x1361.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The loss equals zero when a =1.</p><p>There are other uses of entropy in machine learning, in addition to cross-entropy loss. KL divergence often appears in scenarios where you explicitly want to push one probability distribution toward another. For example, in variational inference and variational autoencoders (VAEs), the training objective includes a KL divergence term to make an approximate posterior close to a prior<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. In reinforcement learning, algorithms like Trust Region Policy Optimization<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> and Proximal Policy Optimization<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> constrain or penalize the KL divergence between the new and old policy to ensure gradual updates.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Urone, Paul Peter, and Roger Hinrichs. "12.3 Second Law of Thermodynamics: Entropy." <em>Physics</em>, OpenStax, 26 Mar. 2020, <a href="https://openstax.org/books/physics/pages/12-3-second-law-of-thermodynamics-entropy">https://openstax.org/books/physics/pages/12-3-second-law-of-thermodynamics-entropy</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>"Entropy (Statistical Thermodynamics)." <em>Wikipedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)">https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)</a>. Accessed 29 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Shannon, Claude E. "A Mathematical Theory of Communication." <em>Bell System Technical Journal</em>, vol. 27, no. 3, 1948, pp. 379&#8211;423, and no. 4, pp. 623&#8211;656.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>"Cross-Entropy." <em>Wikipedia: The Free Encyclopedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Cross-entropy">https://en.wikipedia.org/wiki/Cross-entropy</a>. Accessed 29 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Zhang, Aston, et al. &#8220;4.1 Softmax Regression.&#8221; <em>Dive into Deep Learning</em>, <a href="https://d2l.ai/chapter_linear-classification/softmax-regression.html">https://d2l.ai/chapter_linear-classification/softmax-regression.html</a>. Accessed 29 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational Bayes." <em>arXiv</em>, 20 Dec. 2013, <a href="https://arxiv.org/abs/1312.6114">https://arxiv.org/abs/1312.6114</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Schulman, John, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. "Trust Region Policy Optimization." <em>Proceedings of the 32nd International Conference on Machine Learning</em>, vol. 37, 2015, pp. 1889&#8211;1897. Proceedings of Machine Learning Research, <a href="https://proceedings.mlr.press/v37/schulman15.html">https://proceedings.mlr.press/v37/schulman15.html</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Schulman, John, et al. "Proximal Policy Optimization Algorithms." arXiv, 28 Aug. 2017, https://arxiv.org/abs/1707.06347.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Finding Eigenvalues]]></title><description><![CDATA[How these little guys actually get computed]]></description><link>https://aconai.dev/p/finding-eigenvalues</link><guid isPermaLink="false">https://aconai.dev/p/finding-eigenvalues</guid><dc:creator><![CDATA[Adam Cataldo]]></dc:creator><pubDate>Tue, 27 May 2025 07:05:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3ILF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In AI, eigenvectors and eigenvalues are everywhere. You see them in places like principal component analysis<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, RNN stability analysis<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, and optimization analysis<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. For a square matrix A, an <em>eigenvector</em> is a vector v, along with a corresponding scalar &#955;, the <em>eigenvalue</em>, such that:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Av = \\lambda v&quot;,&quot;id&quot;:&quot;CDNQGWPMYV&quot;}" data-component-name="LatexBlockToDOM"></div><p>The key idea is that applying the linear transformation A to the eigenvector v scales the vector&#8217;s magnitude, but otherwise leaves it untransformed. I recently dug into how eigenvectors and eigenvalues get computed in practice, and I wanted to share it, because it was fairly unintuitive, and actually kind of cool.</p><p>First, it&#8217;s worth sharing how I learned to compute eigenvalues back in school, since you might have seen this in a linear algebra class somewhere. Note that:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nAv &amp;= \\lambda v \\\\\nAv - \\lambda v &amp;= 0 \\\\\nAv - \\lambda I v &amp;= 0 \\\\\n(A - \\lambda I) v &amp;= 0 \n\\end{align}&quot;,&quot;id&quot;:&quot;TSFUOOCGUY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Assuming v is a non-zero eigenvector, this then implies that</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\textrm{det}(A - \\lambda I)=0&quot;,&quot;id&quot;:&quot;XYWNCKBBRV&quot;}" data-component-name="LatexBlockToDOM"></div><p>For an n&#215;n matrix A, this determinant expression translates into a n<sup>th</sup> order polynomial of &#955;. The roots of this equation are the eigenvalues of A. For a given eigenvalue &#955;<sub>i</sub> of A, you can find the corresponding eigenvector by plugging this into the equation and solving for v<sub>i</sub>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(A-\\lambda_i I)v_i = 0&quot;,&quot;id&quot;:&quot;QLWBDDGHAM&quot;}" data-component-name="LatexBlockToDOM"></div><p>That&#8217;s all great, but that doesn&#8217;t tell you how to actually compute the roots of the polynomial. I did some digging and discovered that standard root-finding algorithms like Matlab&#8217;s<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> find the roots by finding eigenvalues of the polynomial&#8217;s <em>companion matrix</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, so this seems circular:</p><ol><li><p>To find the eigenvalues of a matrix, just find the roots of a corresponding polynomial.</p></li><li><p>To find the roots of a polynomial, just find the eigenvalues of a corresponding matrix.</p></li></ol><p>I decided to go down the rabbit hole and get to the bottom of this. Here&#8217;s a quick summary of how modern libraries actually find eigenvalues.</p><h1>Triangular Matrices</h1><p>Note that if you have an upper-triangular matrix, like:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{bmatrix}\na_{11} &amp; a_{12} &amp; \\cdots &amp; a_{1n} \\\\\n0      &amp; a_{22} &amp; \\cdots &amp; a_{2n} \\\\\n\\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\\\\n0      &amp; 0      &amp; \\cdots &amp; a_{nn}\n\\end{bmatrix},&quot;,&quot;id&quot;:&quot;UHTPXMPUCT&quot;}" data-component-name="LatexBlockToDOM"></div><p>then the eigenvalues are just the diagonal entries of the matrix, since:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\textrm{det}(A - \\lambda I)= (a_{11} - \\lambda)(a_{22} - \\lambda)\\cdots(a_{nn} -\\lambda)&quot;,&quot;id&quot;:&quot;ACFNHDHKZT&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is key, because the general algorithm for finding eigenvalues transforms the matrix into a triangular matrix with the same eigenvalues. Once you&#8217;ve done this, your done, since you can just read the diagonal entries as the eigenvalues.</p><h1>Matrix Similarity</h1><p>Two matrices A and B are <em>similar</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> if there exists an invertible matrix P, such that</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;B = P^{-1}AP&quot;,&quot;id&quot;:&quot;AMCRHIZFMO&quot;}" data-component-name="LatexBlockToDOM"></div><p>A key property of similar matrices is that they have the same eigenvalues. So the basic idea of finding eigenvalues for A is to find a similar matrix to A that&#8217;s also upper triangular.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aconai.dev/subscribe?"><span>Subscribe now</span></a></p><h1>Schur decomposition</h1><p>For any square matrix A, there exists an <em>orthogonal matrix</em> Q<sub>*</sub> and an upper triangular matrix H such that</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A = Q_*^{-1}HQ_*&quot;,&quot;id&quot;:&quot;ZUQESVSHFO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Q<sub>*</sub> being orthogonal is just a fancy way of saying:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q_*^TQ_* = I&quot;,&quot;id&quot;:&quot;PWEVTSSUHZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Since H is upper triangular, once you have the Schur decomposition, the eigenvalues are just the diagonal entries of H.</p><p>The tricky part is computing the Schur decomposition. To get that, modern eigenvalue solvers use a simpler matrix decomposition: the QR decomposition<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. The QR decomposition is just a breakdown of A into an orthogonal matrix Q and an upper triangular matrix R:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A = QR&quot;,&quot;id&quot;:&quot;VFPMPJGLDH&quot;}" data-component-name="LatexBlockToDOM"></div><p>QR decomposition is straightforward. A simple implementation using the <em>Gram-Schmidt</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a><em> process</em> is shown <a href="https://github.com/adamcataldo/aconai/blob/0.4/aconai/mathy/qr.py">here</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://github.com/adamcataldo/aconai/blob/0.4/aconai/mathy/qr.py" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ILF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 424w, https://substackcdn.com/image/fetch/$s_!3ILF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 848w, https://substackcdn.com/image/fetch/$s_!3ILF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 1272w, https://substackcdn.com/image/fetch/$s_!3ILF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ILF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png" width="912" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:912,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109531,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://github.com/adamcataldo/aconai/blob/0.4/aconai/mathy/qr.py&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/164513458?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3ILF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 424w, https://substackcdn.com/image/fetch/$s_!3ILF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 848w, https://substackcdn.com/image/fetch/$s_!3ILF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 1272w, https://substackcdn.com/image/fetch/$s_!3ILF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf0056cc-a9aa-451d-b077-a9bf8f111b08_912x578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A more popular implementation uses <em>Householder reflections</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. Householder reflections are a bit less straightforward than Gram-Schmidt, but they are the de facto implementation, because they have better numerical stability.</p><p>There&#8217;s a neat property of QR decomposition that lets you use it to find the Schur decomposition. First, note that for an orthogonal Q:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nQ^TAQ &amp;= Q^T(QR)Q \\\\\n&amp;= (Q^TQ)RQ \\\\\n&amp;= RQ \\\\\n\\end{align}&quot;,&quot;id&quot;:&quot;SJKSMNMYBY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Q<sup>T</sup>AQ is a similar matrix to A. If you set:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nA_0 &amp;= A \\\\\nQ_i, R_i &amp;= \\textrm{qr-decomp}(A_i) \\\\\nA_{i+1} &amp;= R_iQ_i \\\\\n\\end{align}&quot;,&quot;id&quot;:&quot;QAEAMYIRWX&quot;}" data-component-name="LatexBlockToDOM"></div><p>you then get a sequence (A<sub>0</sub>, A<sub>1</sub>, &#8230;) of similar matrices to A. This sequence has the nifty property<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> that</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\lim_{k \\to \\infty} A_k = Q_*AQ_*^T=H&quot;,&quot;id&quot;:&quot;XXBEVSJXTJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where Q* is the orthogonal matrix from the Schur decomposition, and H is the upper-triangular matrix Schur matrix. This gives you your eigenvalues, from it&#8217;s diagonals. This means you can use code like <a href="https://github.com/adamcataldo/aconai/blob/0.4/aconai/mathy/eigenvalues.py">this</a> to find the eigenvalues:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://github.com/adamcataldo/aconai/blob/0.4/aconai/mathy/eigenvalues.py" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9UGM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 424w, https://substackcdn.com/image/fetch/$s_!9UGM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 848w, https://substackcdn.com/image/fetch/$s_!9UGM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 1272w, https://substackcdn.com/image/fetch/$s_!9UGM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9UGM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png" width="834" height="334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:334,&quot;width&quot;:834,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69483,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://github.com/adamcataldo/aconai/blob/0.4/aconai/mathy/eigenvalues.py&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/164513458?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9UGM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 424w, https://substackcdn.com/image/fetch/$s_!9UGM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 848w, https://substackcdn.com/image/fetch/$s_!9UGM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 1272w, https://substackcdn.com/image/fetch/$s_!9UGM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783d0477-121a-4633-baeb-ac3d50a506ce_834x334.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is not too different than what production eigenvalue solvers do. There are a couple of key changes they make, for better performance, however:</p><ol><li><p>Before the first step, they transform A into an <em>upper Hessenberg</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a> form. This is a one-time transformation that costs O( n<sup>3 </sup>) time, but it then reduces the time of each QR step in the loop from O( n<sup>3 </sup>) to O( n<sup>2 </sup>).</p></li><li><p>At each step, they perform a <em>Wilkinson shift</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>. This is a transformation to the matrix that makes the overall loop go from having linear convergence to quadratic convergence, for the general case, and cubic convergence when A is symmetric.</p></li></ol><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Pearson, K. (1901). LIII. <em>On lines and planes of closest fit to systems of points in space</em> . <em>The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science</em>, <em>2</em>(11), 559&#8211;572. <a href="https://doi.org/10.1080/14786440109462720">https://doi.org/10.1080/14786440109462720</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Panda, Priyadarshini, Efstathia Soufleri, and Kaushik Roy. "Evaluating the Stability of Recurrent Neural Models during Training with Eigenvalue Spectra Analysis." <em>2019 International Joint Conference on Neural Networks (IJCNN)</em>. IEEE, 2019.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Ghorbani, B., Krishnan, S. &amp;amp; Xiao, Y.. (2019). An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. &lt;i&gt;Proceedings of the 36th International Conference on Machine Learning&lt;/i&gt;, in &lt;i&gt;Proceedings of Machine Learning Research&lt;/i&gt; 97:2232-2241 Available from <a href="https://proceedings.mlr.press/v97/ghorbani19b.html">https://proceedings.mlr.press/v97/ghorbani19b.html</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>MathWorks. &#8220;Roots&#8239;&#8211;&#8239;Polynomial Roots&#8239;&#8211;&#8239;MATLAB.&#8221; <em>MATLAB Documentation</em>, MathWorks, <a href="http://www.mathworks.com/help/matlab/ref/roots.html">www.mathworks.com/help/matlab/ref/roots.html</a>. Accessed&#8239;26&#8239;May&#8239;2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>&#8220;Companion Matrix.&#8221; <em>Wikipedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Companion_matrix">https://en.wikipedia.org/wiki/Companion_matrix</a>. Accessed 26 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>&#8220;Matrix Similarity.&#8221; <em>Wikipedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Matrix_similarity">https://en.wikipedia.org/wiki/Matrix_similarity</a>. Accessed 26 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>&#8220;QR Decomposition.&#8221; <em>Wikipedia</em>, Wikimedia Foundation, 9&#8239;May&#8239;2025, <a href="https://en.wikipedia.org/wiki/QR_decomposition">https://en.wikipedia.org/wiki/QR_decomposition</a>. Accessed&#8239;26&#8239;May&#8239;2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>&#8220;Gram&#8211;Schmidt Process.&#8221; <em>Wikipedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process">https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process</a>. Accessed 26 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>&#8220;Householder Transformation.&#8221; <em>Wikipedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Householder_transformation">https://en.wikipedia.org/wiki/Householder_transformation</a>. Accessed 26 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Falk, Richard S<strong>.</strong> &#8220;Convergence of the QR Algorithm.&#8221; <em>Math&#8239;574 Lecture Notes</em>, Rutgers U, 2004, pp.&#8239;32&#8209;33. <a href="https://sites.math.rutgers.edu/~falk/math574/lecture9.pdf">sites.math.rutgers.edu</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>&#8220;Hessenberg Matrix.&#8221; <em>Wikipedia</em>, Wikimedia Foundation, <a href="https://en.wikipedia.org/wiki/Hessenberg_matrix">https://en.wikipedia.org/wiki/Hessenberg_matrix</a>. Accessed 26 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>Yang, Ming-Hsuan. &#8220;Lecture 17.&#8221; <em>EECS 275: Matrix Computation</em>, University of California, Merced, <a href="https://faculty.ucmerced.edu/mhyang/course/eecs275/lectures/lecture17.pdf">https://faculty.ucmerced.edu/mhyang/course/eecs275/lectures/lecture17.pdf</a>. Accessed 26 May 2025.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Datasets & Dataloaders & DataFrames]]></title><description><![CDATA[Oh my!]]></description><link>https://aconai.dev/p/datasets-and-dataloaders-and-dataframes</link><guid isPermaLink="false">https://aconai.dev/p/datasets-and-dataloaders-and-dataframes</guid><dc:creator><![CDATA[Adam Cataldo]]></dc:creator><pubDate>Mon, 12 May 2025 21:18:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Dxuo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When starting out with PyTorch<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, one of the first things you have to get your head around are Datasets and Dataloaders<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. The names of these two classes are unintuitive, but basically:</p><ol><li><p>A Dataset is an object used to access samples from your data set.</p></li><li><p>A Dataloader is an object used to batch samples together.</p></li></ol><p>I probably would have named these classes DataAccessor and DataBatcher, to make it more clear what they do, but the key thing to remember is that you need a Dataset to tell PyTorch how to access your data, and a Dataloader to tell PyTorch how to convert samples into batches for training and inference.</p><p>To show how this works in practice, I downloaded the Intel Image Classification dataset from Kaggle<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> to my local machine. I created ImageDateset to read the images from my local directory:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dxuo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dxuo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 424w, https://substackcdn.com/image/fetch/$s_!Dxuo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 848w, https://substackcdn.com/image/fetch/$s_!Dxuo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!Dxuo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dxuo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png" width="1028" height="1282" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1282,&quot;width&quot;:1028,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:291422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dxuo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 424w, https://substackcdn.com/image/fetch/$s_!Dxuo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 848w, https://substackcdn.com/image/fetch/$s_!Dxuo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!Dxuo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61782c4a-768f-45df-9505-d4d13aa9b84b_1028x1282.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The key methods are <code>__len__</code>, which tells PyTorch how many images I have, and <code>__getitem__</code>, which returns an image and its label. For this data set, there are only six labels, which are derived from the subdirectory that contained the image. While not something you&#8217;d typically do in PyTorch, you can work with this <code>ImageSet</code> object directly:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ffxr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ffxr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 424w, https://substackcdn.com/image/fetch/$s_!ffxr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 848w, https://substackcdn.com/image/fetch/$s_!ffxr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 1272w, https://substackcdn.com/image/fetch/$s_!ffxr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ffxr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png" width="674" height="156" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:156,&quot;width&quot;:674,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42065,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ffxr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 424w, https://substackcdn.com/image/fetch/$s_!ffxr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 848w, https://substackcdn.com/image/fetch/$s_!ffxr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 1272w, https://substackcdn.com/image/fetch/$s_!ffxr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8283ab67-7565-4486-ac2d-b38e56d5f245_674x156.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><pre><code>Number of images: 14034
First image shape: torch.Size([3, 150, 150])</code></pre><p>Note that each image dimension is 3 &#10761; 150 &#10761; 150. The first dimension corresponds to color channel: red, green, and blue. The second two dimensions are a 150 &#10761; 150  intensity grid for each color channel in the image.</p><p>I can now wrap this in a Dataloader to produce image batches:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LQvm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LQvm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 424w, https://substackcdn.com/image/fetch/$s_!LQvm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 848w, https://substackcdn.com/image/fetch/$s_!LQvm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 1272w, https://substackcdn.com/image/fetch/$s_!LQvm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LQvm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png" width="768" height="236" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:236,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56114,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LQvm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 424w, https://substackcdn.com/image/fetch/$s_!LQvm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 848w, https://substackcdn.com/image/fetch/$s_!LQvm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 1272w, https://substackcdn.com/image/fetch/$s_!LQvm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7302a833-75c9-4d3b-85ea-8d08d2d25567_768x236.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Note that the first time I ran this, I ran into a problem:</p><pre><code>RuntimeError: stack expects each tensor to be equal size, but got [3, 150, 150] at entry 0 and [3, 103, 150] at entry 37</code></pre><p>The problem is that there&#8217;s at least one image which is smaller than the others. I can make a simple transformation to pad all images to have width and height exactly 150:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F-kx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F-kx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 424w, https://substackcdn.com/image/fetch/$s_!F-kx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 848w, https://substackcdn.com/image/fetch/$s_!F-kx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 1272w, https://substackcdn.com/image/fetch/$s_!F-kx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F-kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png" width="972" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0045894a-105e-4d9b-a120-549c01de9085_972x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:972,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:116866,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F-kx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 424w, https://substackcdn.com/image/fetch/$s_!F-kx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 848w, https://substackcdn.com/image/fetch/$s_!F-kx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 1272w, https://substackcdn.com/image/fetch/$s_!F-kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0045894a-105e-4d9b-a120-549c01de9085_972x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now instead of an error I get:</p><pre><code>Input shape: torch.Size([64, 3, 150, 150])
Label shape: torch.Size([64])
Input shape: torch.Size([64, 3, 150, 150])
Label shape: torch.Size([64])
...
Input shape: torch.Size([64, 3, 150, 150])
Label shape: torch.Size([64])
Input shape: torch.Size([18, 3, 150, 150])
Label shape: torch.Size([18])</code></pre><p>Note that the input shape for all but the last batch is 64 &#10761; 3 &#10761; 150 &#10761; 150. Here, 64 is the batch dimension. I&#8217;m batching 64 examples at a time. The very last batch is smaller, with just 18 samples, because the total number of samples wasn&#8217;t evenly divisible by 64. If this is undesirable I can set the <code>drop_last</code> argument of Dataloader to <code>True</code> to drop the last batch.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aconai.dev/subscribe?"><span>Subscribe now</span></a></p><h1>And DataFrames, Oh My!</h1><p>DataFrames aren&#8217;t a thing in PyTorch, but come instead from the Pandas<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> data analysis library. A DataFrame<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> is an in-memory representation of tabular data, like what you might read from a database or CSV file. I often find myself starting with a DataFrame at the beginning of my AI pipeline. Because I do this so often, I&#8217;ve created a helper class called <a href="https://github.com/adamcataldo/aconai/blob/0.3/aconai/pipelines/row_accessor.py">RowAccessor</a>, which converts a DataFrame to a Dataset:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C-q9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C-q9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 424w, https://substackcdn.com/image/fetch/$s_!C-q9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 848w, https://substackcdn.com/image/fetch/$s_!C-q9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 1272w, https://substackcdn.com/image/fetch/$s_!C-q9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C-q9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png" width="1114" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1114,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:99750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C-q9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 424w, https://substackcdn.com/image/fetch/$s_!C-q9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 848w, https://substackcdn.com/image/fetch/$s_!C-q9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 1272w, https://substackcdn.com/image/fetch/$s_!C-q9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ce1bde8-8a71-428d-9f2a-43e51a68e3de_1114x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It takes a DataFrame and a list of label columns as input:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9gym!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9gym!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 424w, https://substackcdn.com/image/fetch/$s_!9gym!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 848w, https://substackcdn.com/image/fetch/$s_!9gym!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 1272w, https://substackcdn.com/image/fetch/$s_!9gym!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9gym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png" width="612" height="498" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:612,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9gym!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 424w, https://substackcdn.com/image/fetch/$s_!9gym!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 848w, https://substackcdn.com/image/fetch/$s_!9gym!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 1272w, https://substackcdn.com/image/fetch/$s_!9gym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bc15ccb-b934-444d-a04a-b4f3826e1b27_612x498.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><pre><code>Features: tensor([2., 5.], dtype=torch.float64)
Labels: tensor([1])</code></pre><p>If I need to preprocess a DataFrame to get to it ready for PyTorch, I do this before creating the RowAccessor. So if I start with a DataFrame like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K5V2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K5V2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 424w, https://substackcdn.com/image/fetch/$s_!K5V2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 848w, https://substackcdn.com/image/fetch/$s_!K5V2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 1272w, https://substackcdn.com/image/fetch/$s_!K5V2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K5V2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png" width="1016" height="276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9364628-b11d-4747-b26f-28d52893f987_1016x276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:276,&quot;width&quot;:1016,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45688,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K5V2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 424w, https://substackcdn.com/image/fetch/$s_!K5V2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 848w, https://substackcdn.com/image/fetch/$s_!K5V2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 1272w, https://substackcdn.com/image/fetch/$s_!K5V2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9364628-b11d-4747-b26f-28d52893f987_1016x276.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>And I want to make one-hot encoded<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> features for the day of the week in the <code>date</code> column and drop the <code>x2</code> column, I would do something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s50l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s50l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 424w, https://substackcdn.com/image/fetch/$s_!s50l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 848w, https://substackcdn.com/image/fetch/$s_!s50l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 1272w, https://substackcdn.com/image/fetch/$s_!s50l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s50l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png" width="1032" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1032,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79379,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/163154925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s50l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 424w, https://substackcdn.com/image/fetch/$s_!s50l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 848w, https://substackcdn.com/image/fetch/$s_!s50l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 1272w, https://substackcdn.com/image/fetch/$s_!s50l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cad0ab0-adf7-40ca-8a5c-97103625fd4f_1032x246.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><pre><code>   mon  tue  wed  thu  fri  sat  sun   x1  label
0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  4.0      0
1  0.0  0.0  0.0  1.0  0.0  0.0  0.0  5.0      1
2  0.0  0.0  0.0  0.0  1.0  0.0  0.0  6.0      0</code></pre><p>Finally, I just pass the transformed object to RowAccessor.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>PyTorch Foundation. <em>PyTorch</em>. The Linux Foundation, 2025, <a href="https://pytorch.org/">https://pytorch.org/</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>PyTorch Foundation. "Datasets &amp; DataLoaders." <em>PyTorch Tutorials</em>, The Linux Foundation, 16 Jan. 2024, <a href="https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html">https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Bansal, Puneet. <em>Intel Image Classification</em>. Kaggle, 2018, <a href="https://www.kaggle.com/datasets/puneet6060/intel-image-classification">https://www.kaggle.com/datasets/puneet6060/intel-image-classification</a>. Accessed 8 May 2025.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The pandas development team. <em>pandas-dev/pandas: Pandas</em>. Version 2.2.3, Zenodo, 20 Sept. 2024, <a href="https://doi.org/10.5281/zenodo.13819579">https://doi.org/10.5281/zenodo.13819579</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>"pandas.DataFrame." <em>pandas Documentation</em>, The pandas Development Team, 2025, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Brownlee, Jason. (2017). "Why One-Hot Encode Data in Machine Learning?". <em>Machinelearningmastery</em>. <a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/">https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/</a></p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Getting data]]></title><description><![CDATA[The dark underbelly of AI]]></description><link>https://aconai.dev/p/getting-data</link><guid isPermaLink="false">https://aconai.dev/p/getting-data</guid><dc:creator><![CDATA[Adam Cataldo]]></dc:creator><pubDate>Tue, 06 May 2025 18:14:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!YG0J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have a great idea for object recognition, a music generator, or a coding agent? Step one: get the data to train your model. I hate step one. I hate it so much, I go to <a href="https://aconai.dev/p/safely-storing-input-data-for-ai?r=5k7cbx">great lengths</a> to make sure I only need to get the data once. Unless you&#8217;re working with data you already have, expect pain.</p><p>In this post, I walk through an &#8220;easy case&#8221; of getting data: downloading security price data from Yahoo! Finance<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. In this case, there&#8217;s a good library for the task at hand: yfinance<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. yfinance makes API calls to Yahoo! to get CSV price data, and returns them as Pandas DataFrames<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. As you&#8217;ll see, even this easy case has several gotchas you need to be aware of.</p><p>The main method for price data is the <code>download</code> method:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PpCi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PpCi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 424w, https://substackcdn.com/image/fetch/$s_!PpCi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 848w, https://substackcdn.com/image/fetch/$s_!PpCi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 1272w, https://substackcdn.com/image/fetch/$s_!PpCi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png" width="866" height="122" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:122,&quot;width&quot;:866,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24850,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PpCi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 424w, https://substackcdn.com/image/fetch/$s_!PpCi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 848w, https://substackcdn.com/image/fetch/$s_!PpCi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 1272w, https://substackcdn.com/image/fetch/$s_!PpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04c01287-55dd-447b-9baf-9a24826dac5e_866x122.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h1>Gotcha 1: Yahoo! broke yfinance</h1><p>Note, that as I ran this method for the first time for this post, I ran into this issue:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y-dD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y-dD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 424w, https://substackcdn.com/image/fetch/$s_!Y-dD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 848w, https://substackcdn.com/image/fetch/$s_!Y-dD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-dD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y-dD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png" width="1198" height="196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:196,&quot;width&quot;:1198,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y-dD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 424w, https://substackcdn.com/image/fetch/$s_!Y-dD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 848w, https://substackcdn.com/image/fetch/$s_!Y-dD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-dD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7644a2ac-731c-48cc-a8f5-850a477d40b8_1198x196.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>I&#8217;ve seen a similar issue before, with a previous version of the library. I did some searching, and uncovered <a href="https://github.com/ranaroussi/yfinance/issues/2422">this recent bug</a>, that explained why this happened to catch me. In a nutshell, Yahoo! stopped accepting requests with the user agent string that yfinance was providing. The error says I&#8217;ve been rate limited, but that&#8217;s a lie; instead it&#8217;s just that my requests were rejected by Yahoo!&#8217;s servers. With this workaround:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XCRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XCRn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 424w, https://substackcdn.com/image/fetch/$s_!XCRn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 848w, https://substackcdn.com/image/fetch/$s_!XCRn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 1272w, https://substackcdn.com/image/fetch/$s_!XCRn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XCRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png" width="1092" height="122" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:122,&quot;width&quot;:1092,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43080,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XCRn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 424w, https://substackcdn.com/image/fetch/$s_!XCRn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 848w, https://substackcdn.com/image/fetch/$s_!XCRn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 1272w, https://substackcdn.com/image/fetch/$s_!XCRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F227883c8-cd34-4e35-a13d-cc49d27aa415_1092x122.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>I now get some reasonable-looking data:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YG0J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YG0J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 424w, https://substackcdn.com/image/fetch/$s_!YG0J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 848w, https://substackcdn.com/image/fetch/$s_!YG0J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 1272w, https://substackcdn.com/image/fetch/$s_!YG0J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YG0J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png" width="1030" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134209,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YG0J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 424w, https://substackcdn.com/image/fetch/$s_!YG0J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 848w, https://substackcdn.com/image/fetch/$s_!YG0J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 1272w, https://substackcdn.com/image/fetch/$s_!YG0J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba5492c-1db7-4bc0-9407-6f77dac6d631_1030x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Note: the yfinance maintainers will likely fix this bug in a future release, but I&#8217;ve seen similar issues working with this API in the past. It&#8217;s a bit of a game of Whac-A-Mole, since Yahoo! makes frequent changes, some of which cause the yfinance library to break. There are paid services for getting price data that are more reliable and don&#8217;t depend on Yahoo! Finance, like marketstack<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, Alpha Vantage<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, and Polygon.io<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. That said, yfinance is free, and these libraries can be a bit pricey, especially if you don&#8217;t use them frequently.</p><h1>Gotcha 2: Actual rate limiting</h1><p>Allegedly, there actually is rate limiting associated with Yahoo! Finance. yfinance documents how to reduce the risk of being rate limited by using local caches.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>  Unfortunately, there&#8217;s no documentation about what rate limits are actually applied. Yahoo! does document that rate limiting exists in its legal documents,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> but it&#8217;s not clear if this is just to scare people away from abusing the service, or if there is indeed rate limiting.</p><h1>Gotcha 3: Adjusted prices</h1><p>The next issue is which data gets served. By default, all prices served by this library are adjusted to account for dividends and splits. For example, if a $21 stock pays a $1 dividend, and there was no other price movement, the stock price should go down to $20. The investor didn&#8217;t lose any money though, since they could just reinvest their dividend back in the stock. Adjusted prices account for this, so you don&#8217;t see big discontinuities in the stock price when dividends or splits happen. In this example, if the $20 price was today&#8217;s price after the $1 dividend was paid, prices before the dividend date would get multiplied by (20/21). These stack up, to account for multiple dividends and other corporate actions like splits that change the price. For many analyses, this is desirable, but it&#8217;s not desirable if you want to look at a signal like the dividend to price ratio, since the adjusted historic price may be different than the actual historic price.</p><p>On that note, the library can return dividends and splits, and there&#8217;s an argument for getting the price data unadjusted. I personally think it&#8217;s best to get all the data, and then I can decide which columns I care about later. There&#8217;s no way to get all unadjusted and adjusted data in a single call, but I found a workaround, which is to download the unadjusted data, which already includes the adjusted close prices, and and then use the adjusted close prices to adjust the open, high, and low prices. Putting it all together, I end up with:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_uVr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_uVr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 424w, https://substackcdn.com/image/fetch/$s_!_uVr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 848w, https://substackcdn.com/image/fetch/$s_!_uVr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 1272w, https://substackcdn.com/image/fetch/$s_!_uVr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_uVr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png" width="754" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:754,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100537,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_uVr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 424w, https://substackcdn.com/image/fetch/$s_!_uVr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 848w, https://substackcdn.com/image/fetch/$s_!_uVr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 1272w, https://substackcdn.com/image/fetch/$s_!_uVr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5f1412-aca8-470e-b50c-c41c86fa7582_754x592.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Data transformation</h1><p>Once you download the data, you need some format to store the data and work with it later. As I mentioned in my <a href="https://aconai.dev/p/safely-storing-input-data-for-ai">last post</a>, I store the data using Avro. Avro is a serialization format, so that&#8217;s not very helpful for working with the data. DataFrames are easier to work with than the dicts returned from reading Avro files, so I have one step that converts the DataFrame I downloaded from yfinance into an Avro-writable dict:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l15D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l15D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 424w, https://substackcdn.com/image/fetch/$s_!l15D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 848w, https://substackcdn.com/image/fetch/$s_!l15D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 1272w, https://substackcdn.com/image/fetch/$s_!l15D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l15D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png" width="664" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:664,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126951,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l15D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 424w, https://substackcdn.com/image/fetch/$s_!l15D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 848w, https://substackcdn.com/image/fetch/$s_!l15D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 1272w, https://substackcdn.com/image/fetch/$s_!l15D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febb0d4ba-f69a-44ce-b7ee-54406974f15f_664x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>And I have another method that converts the data I cached in the Avro file into a DataFrame:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OAKz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OAKz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 424w, https://substackcdn.com/image/fetch/$s_!OAKz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 848w, https://substackcdn.com/image/fetch/$s_!OAKz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 1272w, https://substackcdn.com/image/fetch/$s_!OAKz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OAKz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png" width="814" height="224" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91af6003-0269-4378-a14a-98465c80239d_814x224.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:224,&quot;width&quot;:814,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42558,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162417057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OAKz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 424w, https://substackcdn.com/image/fetch/$s_!OAKz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 848w, https://substackcdn.com/image/fetch/$s_!OAKz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 1272w, https://substackcdn.com/image/fetch/$s_!OAKz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91af6003-0269-4378-a14a-98465c80239d_814x224.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Note that the final data frame is slightly easier to work with than the one I got from yfinance, because the column names are all proper variable names, so I can use accessors like <code>df.adj_close</code> to access the adjusted close prices. Also, I made the dates be a column, rather than an index, so <code>df.date</code> returns a column of Python date<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> objects. Finally, I picked the format of my Avro schema to make it easy to extract a DataFrame when reading the input.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aconai.dev/subscribe?"><span>Subscribe now</span></a></p><h1>My conclusions</h1><p>As far as data downloads go, this was a fairly mild amount of pain. Still, even in this simple case, I hit several snafus, and it wasn&#8217;t the most fun programming exercise. There are several other common data collection problems I was lucky to avoid, like I didn&#8217;t need to do any web scraping, which is typically a ton of work to build, and a ton of work to maintain as the underlying website evolves. Also, it&#8217;s common that APIs are poorly documented, and it&#8217;s not always clear what they return.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p><em>Yahoo! Finance</em>. Yahoo!, 2025, <a href="https://finance.yahoo.com/">https://finance.yahoo.com/</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Aroussi, Ran. <em>yfinance documentation</em>. 2025, <a href="https://yfinance-python.org/">https://yfinance-python.org/</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The pandas development team. <em>pandas.DataFrame</em>. pandas Documentation, pandas, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html</a>. Accessed 6 May 2025.=</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>marketstack. <em>marketstack</em>, 2025, <a href="https://marketstack.com/">https://marketstack.com/</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Alpha Vantage Inc. Alpha Vantage, 2025, <a href="https://www.alphavantage.co/">https://www.alphavantage.co/</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Polygon.io. Polygon.io, 2025, <a href="https://polygon.io/">https://polygon.io/</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Aroussi, Ran. <em>Caching.</em> yfinance, 2025, <a href="https://yfinance-python.org/advanced/caching.html">https://yfinance-python.org/advanced/caching.html</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Yahoo. <em>Yahoo Developer API Terms of Use</em>. 2025, <a href="https://legal.yahoo.com/us/en/yahoo/terms/product-atos/apiforydn/index.html">https://legal.yahoo.com/us/en/yahoo/terms/product-atos/apiforydn/index.html</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Python Software Foundation. <em>datetime &#8212; Basic Date and Time Types</em>. Python 3.13.3 Documentation, <a href="https://docs.python.org/3/library/datetime.html">https://docs.python.org/3/library/datetime.html</a>.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Safely storing input data for AI pipelines]]></title><description><![CDATA[When I run AI pipelines, I often pull data from external sources, whether that&#8217;s from a proper data service, or data I scraped from a website.]]></description><link>https://aconai.dev/p/safely-storing-input-data-for-ai</link><guid isPermaLink="false">https://aconai.dev/p/safely-storing-input-data-for-ai</guid><dc:creator><![CDATA[Adam Cataldo]]></dc:creator><pubDate>Tue, 29 Apr 2025 21:06:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!iG0U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I run AI pipelines, I often pull data from external sources, whether that&#8217;s from a proper data service, or data I scraped from a website. I&#8217;m a control freak, and this makes me nervous, because it adds a big dependency on an external source I have no control over. There&#8217;s no guarantee that if I try and rerun the same pipeline, the underlying data will still be available. Even if it is, there&#8217;s a risk that the data it returned tomorrow will be different than the data returned today. I want my own copy of the data, and when I make tweaks to the pipeline, I want to use my own copy.</p><p>This basic idea of a <em>materialized view</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> is old, and it comes with some complications. There are many ways to achieve this, and a good place to start with is to think through the requirements. In may case, I&#8217;m the only person accessing this data, so for me:</p><ol><li><p>I don&#8217;t need to access the data frequently. I may go days or longer between running a given pipeline.</p></li><li><p>I&#8217;m only using this data for myself, in a single process, so I don&#8217;t need to worry about concurrency problems like reads happening before writes complete.</p></li><li><p>After I&#8217;ve stored the data, I want to be able to access it even if my machine dies.</p></li><li><p>I want the materialized format to be consistent for all the data I save. By that, I mean, I don&#8217;t want to store the data in CSV for some cases, JSON for others, and so on.</p></li><li><p>I&#8217;d like the materialized data to be compact, to keep my storage costs low.</p></li><li><p>I want to &#8220;remember&#8221; what parameters I used to pull the data with, in case I want different variations of the same data for different pipelines.</p></li><li><p>I&#8217;d like to be able to survive losing my machine without losing the data.</p></li><li><p>If for some reason I lost the materialized data, I want to be able to rematerialize it as long as the underlying data source is still available, because it&#8217;s better than nothing.</p></li><li><p>I&#8217;d like some ability to evolve my approach, if I did start bringing in other collaborators.</p></li></ol><p>Because of the infrequent access, files are a nice way to store the data. I could consider a more &#8220;proper&#8221; database solution, but that adds unnecessary overhead, like the need to have a process running that does nothing most of the time. Also, versioning is non-concern, since I really want the data to be immutable.</p><p>Avro<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> makes a nice choice of file format. Like CSV it can be used to store tabular data, and like JSON, it can also be used to store hierarchical data. Avro&#8217;s binary format is more compact than JSON and CSV, which is helpful for keeping the data compact. Protocol buffers<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> or Thrift<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> could be used in place of Avro. I chose Avro, since I&#8217;m working with Python, and since Avro doesn&#8217;t require code generation, which is fairly clunky when working in a dynamically-typed language like Python.</p><p>It&#8217;s important that the stored data is safe against a crash on my local machine. Because I have an iCloud account and I&#8217;m running on a Mac, I&#8217;m currently just storing the files in iCloud<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. I could just as easily store the files in another distributed file storage system, like Amazon S3<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. The astute reader might notice that storing my files with a 3rd party violates the whole motivation I had in the first place, which was to reduce the 3rd-party dependency. While this is technically true, my data is realistically safer in a cloud file system than on my local machine, since these systems implement redundancy. This is a risk I&#8217;m willing to accept.</p><p>One thing I need is a way to keep track of all the files I&#8217;ve stored locally, and what parameters I used to grab the data. For this I create a &#8220;registry&#8221; to track:</p><ol><li><p>The data source. I use a unique key to index the type of data I&#8217;m loading. If I&#8217;m downloading temperature data from the National Weather Service, this type might be something like <code>temperature.nws</code>.</p></li><li><p>Parameters. The parameters will be data-source dependent. For the temperature data for instance, the parameters might be <code>start_date</code>, <code>end_date</code>, and <code>city</code>.</p></li><li><p>File location. This is where the data is actually stored.</p></li></ol><p>I store the entire registry in a JSON file. JSON affords me the same flexibility as Avro, and is easier to debug if something goes wrong. Since I expect the meta-data in the registry to be small relative to the data stored in Avro files, I&#8217;m not worried about the extra storage overhead it adds on top of a more compressed format.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AC on AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Implementation</h1><p>I created a <a href="http://github.com/adamcataldo/aconai/blob/0.1/aconai/pipelines/data_registry.py">DataRegistry</a> class that&#8217;s responsible for managing the registry. It has two main methods, <code>register</code> and <code>mark_written</code>. The <code>register</code> method is used to register a file for storage, or find a file if one has already been written:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://github.com/adamcataldo/aconai/blob/0.1/aconai/pipelines/data_registry.py#L116" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iG0U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 424w, https://substackcdn.com/image/fetch/$s_!iG0U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 848w, https://substackcdn.com/image/fetch/$s_!iG0U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 1272w, https://substackcdn.com/image/fetch/$s_!iG0U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iG0U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png" width="1164" height="680" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:680,&quot;width&quot;:1164,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155414,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://github.com/adamcataldo/aconai/blob/0.1/aconai/pipelines/data_registry.py#L116&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162213420?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iG0U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 424w, https://substackcdn.com/image/fetch/$s_!iG0U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 848w, https://substackcdn.com/image/fetch/$s_!iG0U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 1272w, https://substackcdn.com/image/fetch/$s_!iG0U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c961327-bf4d-49a1-9555-6ca7930adfdb_1164x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The mark_written method is used to notify the registry that the file has been successfully written to the given location:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CjO1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CjO1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 424w, https://substackcdn.com/image/fetch/$s_!CjO1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 848w, https://substackcdn.com/image/fetch/$s_!CjO1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 1272w, https://substackcdn.com/image/fetch/$s_!CjO1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CjO1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png" width="1104" height="396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1104,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69053,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162213420?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CjO1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 424w, https://substackcdn.com/image/fetch/$s_!CjO1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 848w, https://substackcdn.com/image/fetch/$s_!CjO1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 1272w, https://substackcdn.com/image/fetch/$s_!CjO1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9242c124-faa6-445c-bb44-8aa279b256ee_1104x396.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The reason for separating these two steps is to handle failures when trying to write data. A client of the registry:</p><ol><li><p>Calls register to see if the file has already been written for the given key, schema, and data-provider-specific parameters.</p></li><li><p>If the file has not been written, the client now owns the right to write the file.</p></li><li><p>The client writes the file, and updates the registry to mark the file as written. This way later clients can know that they can safely read the data from the cache or not.</p></li></ol><p>This logic is a bit subtle. To avoid having to think about it more than once, I created an abstract helper class <a href="https://github.com/adamcataldo/aconai/blob/0.1/aconai/pipelines/data_provider.py">DataProvider</a>, which has a concrete method called <code>cached_read</code> that retrieves the records from the local cache if they exist, or adds them to the local cache if they don&#8217;t exist. Concrete subclasses only need to specify how to retrieve records for the first time. This is an example DataProvider subclass:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xv6S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xv6S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 424w, https://substackcdn.com/image/fetch/$s_!xv6S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 848w, https://substackcdn.com/image/fetch/$s_!xv6S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 1272w, https://substackcdn.com/image/fetch/$s_!xv6S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xv6S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png" width="732" height="656" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:732,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83554,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aconai.dev/i/162213420?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xv6S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 424w, https://substackcdn.com/image/fetch/$s_!xv6S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 848w, https://substackcdn.com/image/fetch/$s_!xv6S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 1272w, https://substackcdn.com/image/fetch/$s_!xv6S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1744b01-dfea-4227-97c4-e41f794c27a9_732x656.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><code>get_parameters</code> would return something non-trivial if the same data could be retrieved with different parameters, like different time stamps. <code>get_records</code> can return an iterator, and in particular not  necessarily a list. This can be useful for retrieving large data sets, where it may not be possible to store the entire data set in memory at once.</p><p>The full code for this post can be found in <a href="https://github.com/adamcataldo/aconai/tree/0.1">my GitHub repo</a>.</p><h1>Scaling up to multiple users</h1><p>As I mentioned, this approach was built just for me. There are some elements of this design which can scale up fairly easy to multiple users. First, as long as the &#8220;local&#8221; storage is a distributed file system shared with multiple users, there&#8217;s no problem having multiple users read from the stored data.</p><p>Scaling the registry to multiple users is a bit harder. Rather than storing the entire registry in a single JSON file, it would scale better to have one registry document per key, to limit access collisions. A document database like MongoDB<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> can work well for this scenario, since it can provide read-write locking on a per-key basis out of the box. At the cost of running a document database, the solution generalizes fairly well to a multi-user scenario. There&#8217;s also a need for a distributed lock, to handle the case when two writers are trying to write to the same file at the same time.</p><p>Finally, because the underlying data is stored in Avro, I have flexibility to work with teammates that are using languages other than Python. Most major programming languages have a library to read and write Avro files.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aconai.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aconai.dev/subscribe?"><span>Subscribe now</span></a></p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Frank Wm. Tompa, Jos&#233;A. Blakeley, <em>Maintaining materialized views without accessing base data</em>, Information Systems, Volume 13, Issue 4, 1988, Pages 393-406, ISSN 0306-4379, <a href="https://doi.org/10.1016/0306-4379(88)90005-1">https://doi.org/10.1016/0306-4379(88)90005-1</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Apache Avro. April 18, 2025. <em>Apache Avro&#8482; 1.12.0 Documentation. </em>Apache Avro. <a href="https://avro.apache.org/docs/1.12.0/">https://avro.apache.org/docs/1.12.0/</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Protocol Buffers. 2025. Protocol Buffers Documentation. Protocol Buffers. <a href="https://protobuf.dev/">https://protobuf.dev/</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Apache Thrift. <em>Apache Thrift Documentation</em>. Apache Thrift. <a href="https://thrift.apache.org/docs/">https://thrift.apache.org/docs/</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Apple. <em>iCloud</em>. Apple. <a href="https://www.icloud.com/">https://www.icloud.com/</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Amazon. <em>Amazon S3</em>. Amazon. <a href="https://aws.amazon.com/pm/serv-s3/">https://aws.amazon.com/pm/serv-s3/</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Mongo DB. <em>What is a Document Database?</em> Mongo DB. <a href="https://www.mongodb.com/resources/basics/databases/document-databases">https://www.mongodb.com/resources/basics/databases/document-databases</a></p></div></div>]]></content:encoded></item></channel></rss>