<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Dataset on Svtter's Blog</title><link>https://svtter.cn/en/tags/dataset/</link><description>Recent content in Dataset on Svtter's Blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Mon, 24 Feb 2025 14:34:56 +0800</lastBuildDate><atom:link href="https://svtter.cn/en/tags/dataset/index.xml" rel="self" type="application/rss+xml"/><item><title>Where to Put Your Data Folder</title><link>https://svtter.cn/en/p/where-to-put-your-data-folder/</link><pubDate>Mon, 24 Feb 2025 14:34:56 +0800</pubDate><guid>https://svtter.cn/en/p/where-to-put-your-data-folder/</guid><description>&lt;p&gt;When training models, we should place data and code in the same location as much as possible.&lt;/p&gt;
&lt;p&gt;Keeping them in the same location helps avoid path-related issues, such as needing to specify absolute paths for the data.&lt;/p&gt;
&lt;p&gt;For example, if I set the path to &lt;code&gt;./data/&lt;/code&gt;, I only need to place the data in the &lt;code&gt;./data&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;I can do this by:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ln -s &lt;span class="k"&gt;$(&lt;/span&gt;source-path-of-dataset&lt;span class="k"&gt;)&lt;/span&gt; ./data
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;To link data from other locations.&lt;/p&gt;
&lt;p&gt;If on the same host, git can automatically synchronize these links.&lt;/p&gt;
&lt;p&gt;However, if on different hosts, you need to manage them yourself.&lt;/p&gt;</description></item><item><title>Browsing and Storing Image Datasets</title><link>https://svtter.cn/en/p/browsing-and-storing-image-datasets/</link><pubDate>Sun, 12 Jan 2025 18:31:12 +0800</pubDate><guid>https://svtter.cn/en/p/browsing-and-storing-image-datasets/</guid><description>&lt;p&gt;Browsing datasets can be quite troublesome, especially when the dataset is large.&lt;/p&gt;
&lt;p&gt;npy (numpy array) and h5 files are two common data storage formats.&lt;br&gt;
The drawback of h5 files is that they are prone to data corruption. I have encountered issues multiple times where h5 files could not be opened.&lt;br&gt;
npy files have clear advantages in terms of read speed and file transfer. However, they are loaded entirely into memory at once, which can easily cause memory overflow if the server is not powerful enough.&lt;/p&gt;
&lt;p&gt;Common image datasets typically separate labels and images, such as COCO. This allows you to use a file browser to view images and quickly observe their characteristics. However, in most cases, we don&amp;rsquo;t view images on a local computer but rather work with datasets on a server.&lt;/p&gt;
&lt;p&gt;In 2024, when working with PyTorch, I find it more convenient to directly plot images using matplotlib. Matplotlib is generally used to display a single image, but using subplots allows you to display multiple images simultaneously. If OpenCV is used, you can overlay some label values onto the images. However, there is a drawback: if you are working on a remote server, transferring generated images can consume significant bandwidth.&lt;br&gt;
Ultimately, the choice of method depends on your own judgment!&lt;/p&gt;</description></item></channel></rss>