Have a great idea for object recognition, a music generator, or a coding agent? Step one: get the data to train your model. I hate step one. I hate it so much, I go to great lengths to make sure I only need to get the data once. Unless you’re working with data you already have, expect pain.
In this post, I walk through an “easy case” of getting data: downloading security price data from Yahoo! Finance1. In this case, there’s a good library for the task at hand: yfinance2. yfinance makes API calls to Yahoo! to get CSV price data, and returns them as Pandas DataFrames3. As you’ll see, even this easy case has several gotchas you need to be aware of.
The main method for price data is the download
method:
Gotcha 1: Yahoo! broke yfinance
Note, that as I ran this method for the first time for this post, I ran into this issue:
I’ve seen a similar issue before, with a previous version of the library. I did some searching, and uncovered this recent bug, that explained why this happened to catch me. In a nutshell, Yahoo! stopped accepting requests with the user agent string that yfinance was providing. The error says I’ve been rate limited, but that’s a lie; instead it’s just that my requests were rejected by Yahoo!’s servers. With this workaround:
I now get some reasonable-looking data:
Note: the yfinance maintainers will likely fix this bug in a future release, but I’ve seen similar issues working with this API in the past. It’s a bit of a game of Whac-A-Mole, since Yahoo! makes frequent changes, some of which cause the yfinance library to break. There are paid services for getting price data that are more reliable and don’t depend on Yahoo! Finance, like marketstack4, Alpha Vantage5, and Polygon.io6. That said, yfinance is free, and these libraries can be a bit pricey, especially if you don’t use them frequently.
Gotcha 2: Actual rate limiting
Allegedly, there actually is rate limiting associated with Yahoo! Finance. yfinance documents how to reduce the risk of being rate limited by using local caches.7 Unfortunately, there’s no documentation about what rate limits are actually applied. Yahoo! does document that rate limiting exists in its legal documents,8 but it’s not clear if this is just to scare people away from abusing the service, or if there is indeed rate limiting.
Gotcha 3: Adjusted prices
The next issue is which data gets served. By default, all prices served by this library are adjusted to account for dividends and splits. For example, if a $21 stock pays a $1 dividend, and there was no other price movement, the stock price should go down to $20. The investor didn’t lose any money though, since they could just reinvest their dividend back in the stock. Adjusted prices account for this, so you don’t see big discontinuities in the stock price when dividends or splits happen. In this example, if the $20 price was today’s price after the $1 dividend was paid, prices before the dividend date would get multiplied by (20/21). These stack up, to account for multiple dividends and other corporate actions like splits that change the price. For many analyses, this is desirable, but it’s not desirable if you want to look at a signal like the dividend to price ratio, since the adjusted historic price may be different than the actual historic price.
On that note, the library can return dividends and splits, and there’s an argument for getting the price data unadjusted. I personally think it’s best to get all the data, and then I can decide which columns I care about later. There’s no way to get all unadjusted and adjusted data in a single call, but I found a workaround, which is to download the unadjusted data, which already includes the adjusted close prices, and and then use the adjusted close prices to adjust the open, high, and low prices. Putting it all together, I end up with:
Data transformation
Once you download the data, you need some format to store the data and work with it later. As I mentioned in my last post, I store the data using Avro. Avro is a serialization format, so that’s not very helpful for working with the data. DataFrames are easier to work with than the dicts returned from reading Avro files, so I have one step that converts the DataFrame I downloaded from yfinance into an Avro-writable dict:
And I have another method that converts the data I cached in the Avro file into a DataFrame:
Note that the final data frame is slightly easier to work with than the one I got from yfinance, because the column names are all proper variable names, so I can use accessors like df.adj_close
to access the adjusted close prices. Also, I made the dates be a column, rather than an index, so df.date
returns a column of Python date9 objects. Finally, I picked the format of my Avro schema to make it easy to extract a DataFrame when reading the input.
My conclusions
As far as data downloads go, this was a fairly mild amount of pain. Still, even in this simple case, I hit several snafus, and it wasn’t the most fun programming exercise. There are several other common data collection problems I was lucky to avoid, like I didn’t need to do any web scraping, which is typically a ton of work to build, and a ton of work to maintain as the underlying website evolves. Also, it’s common that APIs are poorly documented, and it’s not always clear what they return.
Yahoo! Finance. Yahoo!, 2025, https://finance.yahoo.com/.
Aroussi, Ran. yfinance documentation. 2025, https://yfinance-python.org/.
The pandas development team. pandas.DataFrame. pandas Documentation, pandas, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html. Accessed 6 May 2025.=
marketstack. marketstack, 2025, https://marketstack.com/.
Alpha Vantage Inc. Alpha Vantage, 2025, https://www.alphavantage.co/.
Polygon.io. Polygon.io, 2025, https://polygon.io/.
Aroussi, Ran. Caching. yfinance, 2025, https://yfinance-python.org/advanced/caching.html.
Yahoo. Yahoo Developer API Terms of Use. 2025, https://legal.yahoo.com/us/en/yahoo/terms/product-atos/apiforydn/index.html.
Python Software Foundation. datetime — Basic Date and Time Types. Python 3.13.3 Documentation, https://docs.python.org/3/library/datetime.html.