WO2004036408A2 - A distributed network attached storage system - Google Patents

A distributed network attached storage system Download PDF

Info

Publication number
WO2004036408A2
WO2004036408A2 PCT/US2003/033175 US0333175W WO2004036408A2 WO 2004036408 A2 WO2004036408 A2 WO 2004036408A2 US 0333175 W US0333175 W US 0333175W WO 2004036408 A2 WO2004036408 A2 WO 2004036408A2
Authority
WO
WIPO (PCT)
Prior art keywords
storage
physical
client
volumes
set forth
Prior art date
Application number
PCT/US2003/033175
Other languages
French (fr)
Other versions
WO2004036408A3 (en
Inventor
Joshua L. Coates
Patrick E. Bozeman
Alfred Gary Landrum
Peter D. Mattis
Naveen Nalam
Drew Roselli
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/367,541 external-priority patent/US7509645B2/en
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to AU2003301379A priority Critical patent/AU2003301379A1/en
Publication of WO2004036408A2 publication Critical patent/WO2004036408A2/en
Publication of WO2004036408A3 publication Critical patent/WO2004036408A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present invention is directed toward the field of data storage, and more particularly toward a distributed network data storage system.
  • NAS network attached storage
  • a computer such as a server
  • physical storage such as one or more hard disk drives
  • the NAS server is accessible over a network.
  • the client computer submits requests to the server to store and retrieve data.
  • NAS systems are severely impacted by their fundamental inability to scale performance and capacity.
  • Current NAS systems only scale performance within the limits of a single NAS server with a single network connection.
  • a single NAS server can only scale capacity to a finite number of disks attached to that NAS server.
  • These fundamental limitations of current file storage systems create a variety of challenges.
  • customers must use multiple NAS systems to meet capacity and performance requirements.
  • the use of multiple NAS systems requires the customer to manage multiple file systems and multiple NAS system images.
  • a storage area network is another configuration used to store large amounts of data.
  • a SAN configuration consists of a network of disks. Clients access disks over a network. Using the SAN configuration, the client typically accesses each individual disk as a separate entity. For example, a client may store a first set of files on a first disk in a network, and store a second set of files on a second disk in the SAN system.
  • this technique requires the clients to manage file storage across the disks on the storage area network. Accordingly, the SAN configuration is less desirable because it requires the client to specifically manage storage on each individual disk. Accordingly, it is desirable to develop a system that manages files with a single file system across multiple disks.
  • a distributed data storage system stores a single image file system across a plurality of physical storage volumes.
  • One or more clients communicate with the distributed data storage system through a network.
  • the distributed data storage system includes a plurality of storage nodes. Each storage node services requests for storage operations on the files stored on the physical storage volumes.
  • the physical storage is direct attached storage.
  • at least one physical storage volume is directly coupled to each storage node.
  • the physical storage volumes are coupled to the storage nodes through a storage area network ("SAN").
  • SAN storage area network
  • a client transmits a request over the network for a file identified in the file system.
  • One of the storage nodes is selected to process the request.
  • the distributed data storage system contains a load balancing switch that receives the request from the client and that selects one of the client nodes to process the storage operation.
  • the storage node accesses at least one of the physical volumes and transmits a response for the storage operation to the client.
  • Figure 2 is a block diagram illustrating one embodiment for assigning client requests in the distributed NAS system.
  • Figure 3 is a block diagram illustrating one embodiment for a distributed NAS system incorporating direct attached disks.
  • Figure 4 is a block diagram illustrating one embodiment for using a SAN configuration for a distributed NAS system.
  • Figure 5 is a flow diagram illustrating one embodiment for initializing a client computer in the distributed NAS system.
  • Figure 6 is a flow diagram illustrating one embodiment for conducting a read operation in the distributed NAS system.
  • Figure 7 is a flow diagram illustrating one embodiment for processing read
  • Figure 8 is a flow diagram illustrating one embodiment for conducting a write operation in the distributed NAS system.
  • Figure 9 is a block diagram illustrating one embodiment for performing a write operation in the volume manager.
  • DETAILED DESCRIPTION The disclosures of U.S. Provisional Patent Application No. 60/419,778, filed October 17, 2002, entitled “A Distributed Storage System", U.S. Patent Application No. 10/368,026, filed February 13, 2003, entitled “A Distributed Network Attached Storage System", U.S. Patent Application No. 10/367,541, filed February 13, 2003, entitled “Methods and
  • FIG. 1 is a block diagram illustrating one embodiment for the distributed network attached storage system of the present invention.
  • the system 100 includes "n" nodes (wherein n is any integer greater than or equal to two).
  • Each node may be implemented with a conventional computer, such as a server.
  • the nodes are coupled to each other in order to provide a single image across each node of the system. In one embodiment, the nodes are coupled together through an Ethernet network.
  • the nodes (1-n) are coupled to a network (150). Also coupled to the network are “m” clients, where "m” is an integer value greater than or equal to one.
  • the network may be any type of network that utilizes any well-known protocol (e.g., TCP/IP, UDP, etc.).
  • the distributed NAS system 100 includes physical storage 110 accessible by the nodes.
  • physical storage 110 may comprise one or more hard disk drives, configured to support storage failure modes (i.e., RAID configuration).
  • a client such as clients 115, 120 and 130, access a node across network 150 to store and retrieve data on physical storage 110.
  • the distributed NAS system of the present invention creates a single system image that scales in a modular way to hundreds of terabytes and several hundred thousand operations per second.
  • the distributed NAS system software runs on industry standard hardware and operates with industry standard operating systems.
  • the distributed NAS system allows flexible configurations based on specific reliability, capacity, and performance requirements.
  • the distributed NAS system scales without requiring any changes to end user behavior, client software or hardware.
  • the distributed NAS system distributes client load evenly so as to eliminate a central control point vulnerable to failure or performance bottlenecks.
  • the distributed NAS system permits storage capacity and performance to scale without disturbing the operation of the system.
  • each node (or server) consists of, in addition to standard hardware and operating system software, a distributed file system manager (165, 175 and 185) and a volume manager (160, 170 and 180) for nodes 1, 2 and n, respectively.
  • FIG. 2 is a block diagram illustrating one embodiment for assigning client requests in the distributed NAS system.
  • clients (1-n) are coupled to a load balance switch 250, accessible over a network.
  • load balance switch 250 comprises a layer four (L4) load-balancing switch.
  • L4 switches are capable of effectively prioritizing TCP and UDP traffic.
  • L4 switches incorporating load- balancing capabilities, distribute requests for HTTP sessions among a number of resources, such as servers.
  • clients, executing storage operations, access load balance switch 250, and load balance switch 250 selects a node (server) to service the client storage operation.
  • server node
  • FIG. 3 is a block diagram illustrating one embodiment for a distributed NAS system incorporating direct attached disks. As shown in Figure 3, each node (nodei, node 2 ... nod ⁇ n ) is coupled to "n" disks (310, 320 and 330). For this embodiment, a node directly accesses one or more disks through a standard hard disk drive interface (e.g., EIDE, SCSI, iSCSI, or fiber channel). Figure 3 illustrates "n" disks attached to a node (server); however, although any number of disks, including a single disk, may be attached to a node without deviating from the spirit or scope of the invention.
  • a standard hard disk drive interface e.g., EIDE, SCSI, iSCSI, or fiber channel.
  • Figure 3 illustrates "n" disks attached to a node (server); however, although any number of disks, including a single disk, may be attached to a node without deviating from the spirit or scope of the invention
  • FIG. 4 is a block diagram illustrating one embodiment for using a SAN configuration for a distributed NAS system.
  • the distributed NAS nodes (servers) are coupled to a storage area network 410.
  • the storage area network 410 couples a plurality of hard disk drives to each node (server) in the distributed NAS system.
  • the storage area network 410 may comprise any type of network, such as Ethernet, Fiber Channel, etc.
  • a node accesses a disk, as necessary, to conduct read and write operations.
  • Each node (server) has access to each disk in the storage area network 410.
  • volume manager 170 determines that data resides on disk 420, then volume manager 170 accesses disk 420 over storage area network 420 in accordance with the protocol for the storage area network 420. If storage area network 410 implements a TCP/IP protocol, then volume manager 170 generates packet requests to disk 420 using the IP address assigned to disk 420.
  • index nodes refe ⁇ ed to as "inodes” uniquely identify files and directories.
  • Inodes map files and directories of a file system to physical locations. Each inode is identified by a number.
  • an inode includes a list of file names and sub directories, if any, as well as a list of data blocks that constitute the file or subdirectory. The inode also contains size, position, etc. of the file or directory.
  • NAS server When a selected node (NAS server) receives a request from the client to service a particular inode, the selected node performs a lookup to obtain the physical location of the corresponding file or directory in the physical media.
  • FIG. 5 is a flow diagram illustrating one embodiment for initializing a client computer in the distributed NAS system.
  • the client Through the client distributed NAS software, the client generates a request to a selected node to mount the NAS file system (block 520, Figure
  • the term "selected node” connotes the node servicing the client request.
  • the node is selected by a load balance switch (i.e., the client generates a network request to the load balance switch, and the load balance switch selects, based on a load balancing criteria, a server to service the request).
  • the selected node obtains the inode for the file system root directory, and generates a client file handle to the root directory (block 530, Figure 5).
  • the selected node determines the inode of the root directory using a "superblock.”
  • the superblock is located at a known address on each disk. Each disk uses a superblock to point to a location on one of the disks that stores the inode for the root directory of the file system. Once the root inode is located, the file system manager finds a list of files and directories contained within the root directory.
  • the file handle is a unique identifier the client uses to access a file or directory in the distributed file system.
  • the distributed file system translates the file handle into an inode.
  • a file handle may include the time and date information for the file/directory. However, any type of file handle may be used as long as the file handle uniquely identifies the file or directory.
  • the selected node (the node processing the client requests) generates a mount table (block 540, Figure 5).
  • the mount table tracks information about the client (e.g., client address, mounted file systems, etc.).
  • the mount table a data structure, is replicated in each node of the distributed NAS system, and is globally and atomically updated (block 550, Figure 5).
  • the selected node transmits to the client a file handle to the root directory (block
  • the client caches the file handle for the root directory (block 570, Figure 5).
  • the file system for the distributed NAS is a high-performance distributed file system.
  • the file system fully distributes both namespace and data across a set of nodes and exports a single system image for clients, applications and administrators.
  • the file system acts as a highly scalable, high-performance file server with no single point of failure.
  • the file system utilizes a single shared disk array. It harnesses the power of multiple disk a ⁇ ays connected either via a storage area network or directly to network servers.
  • the file system is implemented entirely in user space, resulting in a lightweight and portable file system.
  • the file system provides 64-bit support to allow very large file system sizes.
  • the volume manager (160, 170 and 180, Figure 1) controls and virtualizes logical storage volumes, either directly attached to a node, through EIDE, SCSI, iSCSI, fiber channel, or indirectly attached through another server on the LAN.
  • the volume manager offers administrators access to advanced management features. It provides the ability to extend logical volumes across nodes. This results in unprecedented flexible, reliable, high- performance storage management in a multi-node network environment.
  • the volume manager consists of three parts: logical volumes, volume groups, and physical volumes. Each layer has particular properties that contribute to the capabilities of the system.
  • the distributed volume group is the core component of the system.
  • a volume group is a virtualized collection of physical volumes. In its simplest form, a distributed volume group may be analogized to a special data container with reliability properties.
  • a volume group has an associated level of reliability (e.g., RAID level). For example, a distributed volume group may have similar reliability characteristics to traditional RAID 0,1 or 5 disk arrays.
  • Distributed volume groups are made up of any number, type or size of physical volumes.
  • a logical volume is a logical partition of a volume group. The file systems are placed in distributed logical volumes.
  • a logical extent is a logically contiguous piece of storage within a logical volume.
  • a physical volume is any block device, either hardware or software, exposed to the operating system.
  • a physical extent is a contiguous piece of storage within a physical storage device.
  • a sector typically 512 bytes, defines the smallest unit of physical storage on a storage device.
  • a physical volume is a resource that appears to the operating system as a block based storage device (e.g., a RAID device, the disk through fiber channel, or a software RAID device).
  • a volume either logical or physical, consists of units of space referred to as "extents.” Extents are the smallest units of contiguous storage exposed to the distributed volume manager.
  • the volume manager allows unprecedented flexibility and scalability in storage management, to enhance the reliability of large-scale storage systems.
  • the distributed volume manager implements standard RAID 0, 1 and 5 configurations on distributed volume groups. When created, each distributed volume group is given the reliability settings that includes stripe size and raid-set size. Stripe size, sometimes refe ⁇ ed to as a chunk or block, is the smallest granularity of data written to an individual physical volume. Stripe sizes of 8k, 16k and 24k are common. RAID-set size refers to the number of stripes between parity calculations. This is typically equal to the number of physical volumes in a volume group.
  • inodes consist of pointers to physical blocks that store the underlying data.
  • inodes are stored on disk in "ifiles.”
  • inode files contain a list of inodes for all files and directories contained in that directory.
  • the distributed NAS system utilizes a map manager.
  • a map manager stores information to provide an association between inodes and distributed NAS nodes (servers) managing the file or directory.
  • the map manager a data structure, is globally stored (i.e., stored on each node) and is atomically updated. Table 1 is an example map manager used in the distributed NAS system.
  • the distributed NAS system contains three nodes (A, B and C). Inodes within the range from 0 to 100 are managed by nod ⁇ A - Inodes, lying within the range of 101 to 200, are managed by node B , and inodes, falling within the range of 201-300, are managed by nodec.
  • FIG. 6 is a flow diagram illustrating one embodiment for conducting a read operation in the distributed NAS system.
  • a client To conduct a read operation, a client generates a read request to the distributed NAS system with a directory/file name (block 610, Figure 6).
  • the distributed NAS system selects a node to process the request (i.e., selected node). For example, the load-balancing switch may select nodec to process the read operation.
  • a client may generate a request to read the file "/export/temp/foo.txt.”
  • the client must obtain a file handle for "/export/temp/foo.txt.”
  • the client starts with the root file handle (i.e., the root file handle was obtained when the client mounted the distributed file system).
  • the client If the client has cached the file handle for "/export", then the client first requests a file handle for "/export/temp.”
  • the selected node determines the inode for the directory/file (block 620, Figure 6). For the above example, the selected node determines the inode for the directory "/export/temp.” Specifically, the selected node looks-up, in the list of inodes for the "/export” directory, the inode for the directory "/temp.” For purposes of explanation, the associated inode for the directory "/temp" is 55.
  • the selected node determines, from the map manager, the storage node from the directory/file (block 630, Figure 6). For the above example and the map manager shown in Table 1, inode 55 is managed by nod ⁇ A .
  • the selected node queries the storage node (the node managing the directory/file) for a lock on the directory/file (block640, Figure 6).
  • nodec the selected node, queries nod ⁇ A , the storage node, to obtain a lock for the directory "/export/temp."
  • a lock may be an exclusive or shared lock, including both read and write types. If a lock is available for the file/directory, then the storage node assigns a read lock for the directory/file to the selected node (blocks
  • the selected node After obtaining the appropriate lock, the selected node transmits a file handle to the client (block 665, Figure 6). For the above example, the selected node, nodec, transmits a file handle for the directory "/export/temp.” The client caches the file handle. If additional directory/file handles are required to read the file, the process to obtain additional directory/file handles is performed (block 670, Figure 6).
  • the inode for the file "/export/temp/foo.txt.
  • the file system manager looks- up inode 55, and identifies the file, foo.txt,
  • the file system manager obtains the necessary blocks, from the volume manager, to read the file (block 675, Figure 6).
  • the file system manager uses inode 136, looks-up in the file inode, and identifies the physical blocks associated with inode 136. For the above example, if the client requested to read the first 1024 bytes of the file, then the file system manager issues the command, (read blocks 130 and 131, buffer) to read the first two blocks of the file (e.g., the first two blocks of the file "/export/temp/foo.txt" are numbered 130 and 131). In response, the volume manager places the first 1024 bytes of the file "/export/temp/foo.txt" in a buffer. The selected node returns the data from the buffer to the client (block 680, Figure 6).
  • FIG 7 is a flow diagram illustrating one embodiment for processing read operations in the volume manager.
  • the volume manager receives the request from the file system manager (block 710, Figure 7).
  • a volume is spread across nodes.
  • Each disk e.g., 0-256 sectors
  • the volume manager determines the physical volume node for the subject of the read operation (block 720, Figure 7).
  • the volume manager communicates with the physical volumes.
  • the file system manager requests the volume manager to read/write a block or a group of sectors (e.g., sectors 24-64, etc.).
  • the volume manager determines the disk and disk offset (block 730, Figure 7).
  • the volume manager algebraically determines the location of the logical sectors on the physical volumes.
  • Table 2 illustrates an example mapping from disks to nodes for an example distributed NAS system.
  • the volume manager calculates the node in accordance with the arrangement illustrated in Table 2.
  • the disks are apportioned by sectors, and the offset measures the number of sectors within a disk.
  • the volume manager obtains blocks of data from the node, disk on the node and the offset within the disk (block 740, Figure 7).
  • the volume manager then returns data to the buffer (file system manager) (block 750, Figure 7).
  • Figure 8 is a flow diagram illustrating one embodiment for conducting a write operation in the distributed NAS system.
  • a client generates a write request to the distributed NAS system with a directory/file name (block 810, Figure 8).
  • the distributed NAS system selects a node to process the request (e.g., node c ).
  • a client may generate a request to write to the file "/export/temp/foo.txt.”
  • the client must obtain a file handle for "/export/temp/foo.txt.”
  • the client starts with the root file handle.
  • the selected node determines the inode for the directory/file (block 820, Figure 8). For the above example, the selected node determines the inode for the directory "/export/temp.”, inode 55. The selected node determines, from the map manager, the storage node from the directory/file for the associated inode (block 830, Figure 8). For the above example (Table 1), inode 55 is managed by nod ⁇ A . The selected node queries the storage node (the node managing the directory/file) for a lock on the directory/file (block 840, Figure 8).
  • nodec the selected node, queries nod ⁇ A , the storage node, to obtain a write lock for the directory "/export/temp." If a write lock is available for the file/directory, then the storage node assigns the write lock for the directory/file to the selected node (blocks 845 and 860, Figure 8). If a lock is not available, then the storage node attempts to revoke the existing lock(s) (blocks 845 and 850, Figure 8). If the storage node can revoke the existing lock(s), then the storage node assigns the write lock to the selected node for the directory/file (blocks 850 and 860, Figure 8). If the storage node cannot revoke existing lock(s), then an e ⁇ or message is transmitted to the client that the file/directory is not cu ⁇ ently available for writing (blocks 850 and 855, Figure 8).
  • the selected node transmits a file handle to the client (block 865, Figure 8).
  • the selected node, nodec transmits a file handle for the directory "/export/temp.”
  • the client caches the file handle. If additional directory/file handles are required to read the file, the process to obtain additional directory/file handles are performed (block 870, Figure 8).
  • the client generates a read request for "expert/temp/foo.txt.”
  • the map manager identifies nodes as the owner of inode 136.
  • a lock for the file, foo.txt, is obtained. Nodec then returns the file handle of foo.txt to the client.
  • the client transmits data, for the write operation, and the file handle (block 875, Figure 8).
  • the file system manager, and the volume manager execute the write operation (See Figure 9).
  • the client receives a written confirmation from the file system manager (block 880, Figure 8).
  • Figure 9 is a block diagram illustrating one embodiment for performing a write operation in the volume manager.
  • the file manager on the selected node receives data from the client for the write operation (block 910, Figure 9).
  • the file system manager requests blocks of data from the volume manager (block 915, Figure 9).
  • the volume manager determines the physical volume node for the write operation (block 920, Figure 9).
  • the volume manager determines the disk and disk offset (block 930, Figure 9).
  • the volume manager obtains blocks of data from the node, disk and offset (block 940, Figure 9).
  • the volume manager returns read data to a buffer (block 950, Figure 9).
  • the file system manager writes data for the write operation to the buffer (block 960, Figure 9). Thereafter, volume manager writes data from the buffer to the physical disk (block 970, Figure 9).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed data storage system stores a single image file system across a plurality of physical storage volumes. The physical storage may be direct attached storage, or may be coupled through a storage area network ('SAN'). One or more clients communicate with a plurality of storage nodes through a network. A client of the distributed data storage system transmits a request over the network for a file identified in the file system. A load-balancing switch selects one of the storage nodes to process the request. The storage node accesses at least one of the physical volumes and transmits a response for the storage operation to the client.

Description

TITLE OF THE INVENTION A DISTRIBUTED NETWORK ATTACHED STORAGE SYSTEM
CROSS-REFERENCES TO RELATED APPLICATIONS
This application claims the benefit of: U.S. Provisional Patent Application No. 60/419,778, filed October 17, 2002, entitled "A Distributed Storage System"; U.S. Patent
Application No. 10/368,026, filed February 13, 2003, entitled "A Distributed Network
Attached Storage System"; U.S. Patent Application No. 10/367,541, filed February 13, 2003, entitled "Methods and Apparatus for Load Balancing Storage Nodes In A Distributed
Network Attached Storage System"; and U.S. Patent Application No. 10/367,436, filed February 13, 2003, entitled "Methods and Apparatus For Load Balancing Storage Nodes In A
Distributed Storage Area Network System."
BACKGROUND OF THE INVENTION
Field of the Invention:
The present invention is directed toward the field of data storage, and more particularly toward a distributed network data storage system.
Art Background:
There is an increasing demand for systems that store large amounts of data. Many companies struggle to provide scalable, cost-effective storage solutions for large amounts of data stored in files (e.g., terabytes of data). One type of prior art system used to store data for computers is known as network attached storage ("NAS"). In a NAS configuration, a computer, such as a server, is coupled to physical storage, such as one or more hard disk drives. The NAS server is accessible over a network. In order to access the storage, the client computer submits requests to the server to store and retrieve data.
Conventional NAS technology has several inherent limitations. First, NAS systems are severely impacted by their fundamental inability to scale performance and capacity. Current NAS systems only scale performance within the limits of a single NAS server with a single network connection. Thus, a single NAS server can only scale capacity to a finite number of disks attached to that NAS server. These fundamental limitations of current file storage systems create a variety of challenges. First, customers must use multiple NAS systems to meet capacity and performance requirements. The use of multiple NAS systems requires the customer to manage multiple file systems and multiple NAS system images.
These attempts lead to inefficient utilization of storage assets because files must be manually distributed across multiple NAS systems to meet overall capacity and performance requirements. Invariably, this leaves pockets of unused capacity in the multiple NAS systems. Moreover, frequently accessed files, sometimes referred to as hot files, may only be served by a single NAS server, resulting in a bottleneck that impacts performance of the storage system. These issues result in substantially higher management costs to the end-user as well as high acquisition costs to purchase proprietary NAS systems.
A storage area network ("SAN") is another configuration used to store large amounts of data. In general, a SAN configuration consists of a network of disks. Clients access disks over a network. Using the SAN configuration, the client typically accesses each individual disk as a separate entity. For example, a client may store a first set of files on a first disk in a network, and store a second set of files on a second disk in the SAN system. Thus, this technique requires the clients to manage file storage across the disks on the storage area network. Accordingly, the SAN configuration is less desirable because it requires the client to specifically manage storage on each individual disk. Accordingly, it is desirable to develop a system that manages files with a single file system across multiple disks.
SUMMARY OF THE INVENTION A distributed data storage system stores a single image file system across a plurality of physical storage volumes. One or more clients communicate with the distributed data storage system through a network. The distributed data storage system includes a plurality of storage nodes. Each storage node services requests for storage operations on the files stored on the physical storage volumes. In one embodiment, the physical storage is direct attached storage. For this embodiment, at least one physical storage volume is directly coupled to each storage node. In another embodiment, the physical storage volumes are coupled to the storage nodes through a storage area network ("SAN").
To conduct a storage operation, including read and write operations, a client transmits a request over the network for a file identified in the file system. One of the storage nodes is selected to process the request. In one embodiment, the distributed data storage system contains a load balancing switch that receives the request from the client and that selects one of the client nodes to process the storage operation. To process the request, the storage node accesses at least one of the physical volumes and transmits a response for the storage operation to the client. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram illustrating one embodiment for the distributed network attached storage system of the present invention.
Figure 2 is a block diagram illustrating one embodiment for assigning client requests in the distributed NAS system.
Figure 3 is a block diagram illustrating one embodiment for a distributed NAS system incorporating direct attached disks.
Figure 4 is a block diagram illustrating one embodiment for using a SAN configuration for a distributed NAS system.
Figure 5 is a flow diagram illustrating one embodiment for initializing a client computer in the distributed NAS system.
Figure 6 is a flow diagram illustrating one embodiment for conducting a read operation in the distributed NAS system.
Figure 7 is a flow diagram illustrating one embodiment for processing read
, operations in the volume manager.
Figure 8 is a flow diagram illustrating one embodiment for conducting a write operation in the distributed NAS system.
Figure 9 is a block diagram illustrating one embodiment for performing a write operation in the volume manager. DETAILED DESCRIPTION The disclosures of U.S. Provisional Patent Application No. 60/419,778, filed October 17, 2002, entitled "A Distributed Storage System", U.S. Patent Application No. 10/368,026, filed February 13, 2003, entitled "A Distributed Network Attached Storage System", U.S. Patent Application No. 10/367,541, filed February 13, 2003, entitled "Methods and
Apparatus for Load Balancing Storage Nodes In A Distributed Network Attached Storage System", and U.S. Patent Application No. 10/367,436, filed February 13, 2003, entitled "Methods and Apparatus For Load Balancing Storage Nodes In A Distributed Storage Area Network System" are hereby expressly incorporated herein by reference.
Figure 1 is a block diagram illustrating one embodiment for the distributed network attached storage system of the present invention. As shown in Figure 1, the system 100 includes "n" nodes (wherein n is any integer greater than or equal to two). Each node may be implemented with a conventional computer, such as a server. Also a shown in Figure 1, the nodes are coupled to each other in order to provide a single image across each node of the system. In one embodiment, the nodes are coupled together through an Ethernet network.
The nodes (1-n) are coupled to a network (150). Also coupled to the network are "m" clients, where "m" is an integer value greater than or equal to one. The network may be any type of network that utilizes any well-known protocol (e.g., TCP/IP, UDP, etc.). Also, as shown in Figure 1, the distributed NAS system 100 includes physical storage 110 accessible by the nodes. For example, physical storage 110 may comprise one or more hard disk drives, configured to support storage failure modes (i.e., RAID configuration). A client, such as clients 115, 120 and 130, access a node across network 150 to store and retrieve data on physical storage 110.
In general, the distributed NAS system of the present invention creates a single system image that scales in a modular way to hundreds of terabytes and several hundred thousand operations per second. In one embodiment, to minimize costs, the distributed NAS system software runs on industry standard hardware and operates with industry standard operating systems. The distributed NAS system allows flexible configurations based on specific reliability, capacity, and performance requirements. In addition, the distributed NAS system scales without requiring any changes to end user behavior, client software or hardware. For optimal performance, in one embodiment, the distributed NAS system distributes client load evenly so as to eliminate a central control point vulnerable to failure or performance bottlenecks. The distributed NAS system permits storage capacity and performance to scale without disturbing the operation of the system. To achieve these goals, the distributed NAS system utilizes a distributed file system as well as a volume manager. In one embodiment, each node (or server) consists of, in addition to standard hardware and operating system software, a distributed file system manager (165, 175 and 185) and a volume manager (160, 170 and 180) for nodes 1, 2 and n, respectively.
Figure 2 is a block diagram illustrating one embodiment for assigning client requests in the distributed NAS system. For this embodiment, clients (1-n) are coupled to a load balance switch 250, accessible over a network. In one embodiment, load balance switch 250 comprises a layer four (L4) load-balancing switch. In general, L4 switches are capable of effectively prioritizing TCP and UDP traffic. In addition, L4 switches, incorporating load- balancing capabilities, distribute requests for HTTP sessions among a number of resources, such as servers. In operation, clients, executing storage operations, access load balance switch 250, and load balance switch 250 selects a node (server) to service the client storage operation.
The nodes of the distributed NAS system communicate with one or more hard disk drives. Figure 3 is a block diagram illustrating one embodiment for a distributed NAS system incorporating direct attached disks. As shown in Figure 3, each node (nodei, node2 ... nodβn) is coupled to "n" disks (310, 320 and 330). For this embodiment, a node directly accesses one or more disks through a standard hard disk drive interface (e.g., EIDE, SCSI, iSCSI, or fiber channel). Figure 3 illustrates "n" disks attached to a node (server); however, although any number of disks, including a single disk, may be attached to a node without deviating from the spirit or scope of the invention.
In another embodiment, the nodes of the distributed NAS system utilize disks coupled through a network (e.g., storage area network "SAN"). Figure 4 is a block diagram illustrating one embodiment for using a SAN configuration for a distributed NAS system. As shown in Figure 4, the distributed NAS nodes (servers) are coupled to a storage area network 410. The storage area network 410 couples a plurality of hard disk drives to each node (server) in the distributed NAS system. The storage area network 410 may comprise any type of network, such as Ethernet, Fiber Channel, etc. In operation, a node accesses a disk, as necessary, to conduct read and write operations. Each node (server) has access to each disk in the storage area network 410. For example, if volume manager 170 determines that data resides on disk 420, then volume manager 170 accesses disk 420 over storage area network 420 in accordance with the protocol for the storage area network 420. If storage area network 410 implements a TCP/IP protocol, then volume manager 170 generates packet requests to disk 420 using the IP address assigned to disk 420.
In general, index nodes, refeπed to as "inodes" uniquely identify files and directories. Inodes map files and directories of a file system to physical locations. Each inode is identified by a number. For a directory, an inode includes a list of file names and sub directories, if any, as well as a list of data blocks that constitute the file or subdirectory. The inode also contains size, position, etc. of the file or directory. When a selected node (NAS server) receives a request from the client to service a particular inode, the selected node performs a lookup to obtain the physical location of the corresponding file or directory in the physical media.
As an initial procedure, a client of the distributed NAS system mounts the distributed file system. Figure 5 is a flow diagram illustrating one embodiment for initializing a client computer in the distributed NAS system. Through the client distributed NAS software, the client generates a request to a selected node to mount the NAS file system (block 520, Figure
5). As used herein, the term "selected node" connotes the node servicing the client request. As described above, in one embodiment, the node is selected by a load balance switch (i.e., the client generates a network request to the load balance switch, and the load balance switch selects, based on a load balancing criteria, a server to service the request).
The selected node (file system manager) obtains the inode for the file system root directory, and generates a client file handle to the root directory (block 530, Figure 5). The selected node determines the inode of the root directory using a "superblock." The superblock is located at a known address on each disk. Each disk uses a superblock to point to a location on one of the disks that stores the inode for the root directory of the file system. Once the root inode is located, the file system manager finds a list of files and directories contained within the root directory.
The file handle, a client side term, is a unique identifier the client uses to access a file or directory in the distributed file system. In one embodiment, the distributed file system translates the file handle into an inode. In addition, a file handle may include the time and date information for the file/directory. However, any type of file handle may be used as long as the file handle uniquely identifies the file or directory.
The selected node (the node processing the client requests) generates a mount table (block 540, Figure 5). In general, the mount table tracks information about the client (e.g., client address, mounted file systems, etc.). The mount table, a data structure, is replicated in each node of the distributed NAS system, and is globally and atomically updated (block 550, Figure 5). The selected node transmits to the client a file handle to the root directory (block
560, Figure 5). The client caches the file handle for the root directory (block 570, Figure 5).
In one embodiment, the file system for the distributed NAS is a high-performance distributed file system. The file system fully distributes both namespace and data across a set of nodes and exports a single system image for clients, applications and administrators. As a multi-node system, the file system acts as a highly scalable, high-performance file server with no single point of failure. As a storage medium, the file system utilizes a single shared disk array. It harnesses the power of multiple disk aπays connected either via a storage area network or directly to network servers. The file system is implemented entirely in user space, resulting in a lightweight and portable file system. In one embodiment, the file system provides 64-bit support to allow very large file system sizes.
The volume manager (160, 170 and 180, Figure 1) controls and virtualizes logical storage volumes, either directly attached to a node, through EIDE, SCSI, iSCSI, fiber channel, or indirectly attached through another server on the LAN. The volume manager offers administrators access to advanced management features. It provides the ability to extend logical volumes across nodes. This results in unprecedented flexible, reliable, high- performance storage management in a multi-node network environment.
The volume manager consists of three parts: logical volumes, volume groups, and physical volumes. Each layer has particular properties that contribute to the capabilities of the system. The distributed volume group is the core component of the system. A volume group is a virtualized collection of physical volumes. In its simplest form, a distributed volume group may be analogized to a special data container with reliability properties. A volume group has an associated level of reliability (e.g., RAID level). For example, a distributed volume group may have similar reliability characteristics to traditional RAID 0,1 or 5 disk arrays. Distributed volume groups are made up of any number, type or size of physical volumes.
A logical volume is a logical partition of a volume group. The file systems are placed in distributed logical volumes. A logical extent is a logically contiguous piece of storage within a logical volume. A physical volume is any block device, either hardware or software, exposed to the operating system. A physical extent is a contiguous piece of storage within a physical storage device. A sector, typically 512 bytes, defines the smallest unit of physical storage on a storage device.
A physical volume is a resource that appears to the operating system as a block based storage device (e.g., a RAID device, the disk through fiber channel, or a software RAID device). A volume, either logical or physical, consists of units of space referred to as "extents." Extents are the smallest units of contiguous storage exposed to the distributed volume manager.
The volume manager allows unprecedented flexibility and scalability in storage management, to enhance the reliability of large-scale storage systems. In one embodiment, the distributed volume manager implements standard RAID 0, 1 and 5 configurations on distributed volume groups. When created, each distributed volume group is given the reliability settings that includes stripe size and raid-set size. Stripe size, sometimes refeπed to as a chunk or block, is the smallest granularity of data written to an individual physical volume. Stripe sizes of 8k, 16k and 24k are common. RAID-set size refers to the number of stripes between parity calculations. This is typically equal to the number of physical volumes in a volume group.
As discussed above, inodes consist of pointers to physical blocks that store the underlying data. In one embodiment, inodes are stored on disk in "ifiles." For directories, inode files contain a list of inodes for all files and directories contained in that directory. In one embodiment, the distributed NAS system utilizes a map manager. In general, a map manager stores information to provide an association between inodes and distributed NAS nodes (servers) managing the file or directory. The map manager, a data structure, is globally stored (i.e., stored on each node) and is atomically updated. Table 1 is an example map manager used in the distributed NAS system.
TABLE 1
Figure imgf000014_0001
For this example, the distributed NAS system contains three nodes (A, B and C). Inodes within the range from 0 to 100 are managed by nodβA- Inodes, lying within the range of 101 to 200, are managed by nodeB, and inodes, falling within the range of 201-300, are managed by nodec.
Figure 6 is a flow diagram illustrating one embodiment for conducting a read operation in the distributed NAS system. To conduct a read operation, a client generates a read request to the distributed NAS system with a directory/file name (block 610, Figure 6). The distributed NAS system selects a node to process the request (i.e., selected node). For example, the load-balancing switch may select nodec to process the read operation. Also, for this example, a client may generate a request to read the file "/export/temp/foo.txt." For this example, the client must obtain a file handle for "/export/temp/foo.txt." To accomplish this, the client starts with the root file handle (i.e., the root file handle was obtained when the client mounted the distributed file system).
If the client has cached the file handle for "/export", then the client first requests a file handle for "/export/temp." In response to the client request, the selected node (server) determines the inode for the directory/file (block 620, Figure 6). For the above example, the selected node determines the inode for the directory "/export/temp." Specifically, the selected node looks-up, in the list of inodes for the "/export" directory, the inode for the directory "/temp." For purposes of explanation, the associated inode for the directory "/temp" is 55.
With the inode, the selected node determines, from the map manager, the storage node from the directory/file (block 630, Figure 6). For the above example and the map manager shown in Table 1, inode 55 is managed by nodβA. The selected node queries the storage node (the node managing the directory/file) for a lock on the directory/file (block640, Figure 6). In the example set forth above, nodec, the selected node, queries nodβA, the storage node, to obtain a lock for the directory "/export/temp." A lock may be an exclusive or shared lock, including both read and write types. If a lock is available for the file/directory, then the storage node assigns a read lock for the directory/file to the selected node (blocks
645 and 660, Figure 6). If a lock is not available, then the storage node attempts to revoke the existing lock(s) (blocks 645 and 650, Figure 6). If the storage node can revoke the existing lock(s), then the storage node assigns a read lock to the selected node for the directory/file (blocks 650 and 660, Figure 6). If the storage node cannot revoke existing lock(s), then an error message is transmitted to the client that the file/directory is not cuπently available for reading (blocks 650 and 655, Figure 6).
After obtaining the appropriate lock, the selected node transmits a file handle to the client (block 665, Figure 6). For the above example, the selected node, nodec, transmits a file handle for the directory "/export/temp." The client caches the file handle. If additional directory/file handles are required to read the file, the process to obtain additional directory/file handles is performed (block 670, Figure 6). For the above example, the client generates a read request for "expert/temp/foo.txt." Thereafter, the selected node determines the inode for the file "/export/temp/foo.txt." For this example, the file system manager looks- up inode 55, and identifies the file, foo.txt, as being located in the "/temp directory." The file system manager extracts the inode associated with the file, foo.txt (e.g., inode = 136). The map manager identifies nodes as the owner of inode 136. Thus, nodec, the selected node, communicates with nodes, the storage node, to obtain a lock for the file, foo.txt. Nodec then returns the file handle offoo.txt to the client.
In response to the read request, the file system manager obtains the necessary blocks, from the volume manager, to read the file (block 675, Figure 6). The file system manager, using inode 136, looks-up in the file inode, and identifies the physical blocks associated with inode 136. For the above example, if the client requested to read the first 1024 bytes of the file, then the file system manager issues the command, (read blocks 130 and 131, buffer) to read the first two blocks of the file (e.g., the first two blocks of the file "/export/temp/foo.txt" are numbered 130 and 131). In response, the volume manager places the first 1024 bytes of the file "/export/temp/foo.txt" in a buffer. The selected node returns the data from the buffer to the client (block 680, Figure 6).
In general, the volume manager responds to requests from the distributed file system manager. Figure 7 is a flow diagram illustrating one embodiment for processing read operations in the volume manager. To initiate the process, the volume manager receives the request from the file system manager (block 710, Figure 7). A volume is spread across nodes. Each disk (e.g., 0-256 sectors) requires a mapping to translate virtual sectors to physical sectors. The volume manager determines the physical volume node for the subject of the read operation (block 720, Figure 7). The volume manager communicates with the physical volumes. To conduct a read operation, the file system manager requests the volume manager to read/write a block or a group of sectors (e.g., sectors 24-64, etc.).
The volume manager determines the disk and disk offset (block 730, Figure 7). The volume manager algebraically determines the location of the logical sectors on the physical volumes. Table 2 illustrates an example mapping from disks to nodes for an example distributed NAS system.
TABLE 2
Figure imgf000017_0001
Figure imgf000018_0001
For this embodiment, the volume manager calculates the node in accordance with the arrangement illustrated in Table 2. The disks are apportioned by sectors, and the offset measures the number of sectors within a disk. The volume manager obtains blocks of data from the node, disk on the node and the offset within the disk (block 740, Figure 7). The volume manager then returns data to the buffer (file system manager) (block 750, Figure 7).
Figure 8 is a flow diagram illustrating one embodiment for conducting a write operation in the distributed NAS system. First, a client generates a write request to the distributed NAS system with a directory/file name (block 810, Figure 8). The distributed NAS system selects a node to process the request (e.g., nodec). Using the above example, a client may generate a request to write to the file "/export/temp/foo.txt." For this example, the client must obtain a file handle for "/export/temp/foo.txt." As described above, the client starts with the root file handle. If the client has cached the file handle for "/export", then the client first requests a file handle for "/export/temp." In response to the client request, the selected node (server) determines the inode for the directory/file (block 820, Figure 8). For the above example, the selected node determines the inode for the directory "/export/temp.", inode 55. The selected node determines, from the map manager, the storage node from the directory/file for the associated inode (block 830, Figure 8). For the above example (Table 1), inode 55 is managed by nodβA. The selected node queries the storage node (the node managing the directory/file) for a lock on the directory/file (block 840, Figure 8). Thus, nodec, the selected node, queries nodβA, the storage node, to obtain a write lock for the directory "/export/temp." If a write lock is available for the file/directory, then the storage node assigns the write lock for the directory/file to the selected node (blocks 845 and 860, Figure 8). If a lock is not available, then the storage node attempts to revoke the existing lock(s) (blocks 845 and 850, Figure 8). If the storage node can revoke the existing lock(s), then the storage node assigns the write lock to the selected node for the directory/file (blocks 850 and 860, Figure 8). If the storage node cannot revoke existing lock(s), then an eπor message is transmitted to the client that the file/directory is not cuπently available for writing (blocks 850 and 855, Figure 8).
After obtaining the appropriate lock, the selected node transmits a file handle to the client (block 865, Figure 8). For the above example, the selected node, nodec, transmits a file handle for the directory "/export/temp." The client caches the file handle. If additional directory/file handles are required to read the file, the process to obtain additional directory/file handles are performed (block 870, Figure 8). For the above example, the client generates a read request for "expert/temp/foo.txt." As discussed above, the selected node determines the inode for the file "/export/temp/foo.txt", (e.g., inode = 136). The map manager identifies nodes as the owner of inode 136. A lock for the file, foo.txt, is obtained. Nodec then returns the file handle of foo.txt to the client. The client transmits data, for the write operation, and the file handle (block 875, Figure 8). The file system manager, and the volume manager execute the write operation (See Figure 9). The client receives a written confirmation from the file system manager (block 880, Figure 8).
Figure 9 is a block diagram illustrating one embodiment for performing a write operation in the volume manager. The file manager on the selected node receives data from the client for the write operation (block 910, Figure 9). In response, the file system manager requests blocks of data from the volume manager (block 915, Figure 9). The volume manager determines the physical volume node for the write operation (block 920, Figure 9). The volume manager determines the disk and disk offset (block 930, Figure 9). The volume manager then obtains blocks of data from the node, disk and offset (block 940, Figure 9). The volume manager returns read data to a buffer (block 950, Figure 9). The file system manager writes data for the write operation to the buffer (block 960, Figure 9). Thereafter, volume manager writes data from the buffer to the physical disk (block 970, Figure 9).
Although the present invention has been described in terms of specific exemplary embodiments, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims

What is claimed is:
L A distributed data storage system comprising: a plurality of physical storage volumes for storing a plurality of files of data accessible by a single file system; network; a plurality of storage nodes, coupled to said network, each one of said storage nodes having access to each of said files stored on said physical storage volumes, and wherein each of said storage nodes for receiving a request from a client for a storage operation on a file identified in said file system, for processing said storage operation by accessing at least one of said physical volumes, and for transmitting a response for said storage operation to said client.
2. The distributed storage system as set forth in claim 1, further comprising a load balancing switch, coupled to said network, for receiving said request from a client for a storage operation and for selecting one of said client nodes to process said storage operation.
3. The distributed storage system as set forth in claim 1, wherein at least one of said physical storage volumes is directly coupled to each of said storage nodes.
4. The distributed storage system as set forth in claim 1, wherein said physical storage volumes are coupled through a network accessible by said storage nodes.
5. The distributed storage system as set forth in claim 1, wherein a storage node comprises: a file system manager for processing said client requests for storage operations; and a volume manager for accessing said physical volumes.
6. The distributed storage system as set forth in claim 5, wherein said file system manager of a first storage node for communicating to a volume manger of a second storage node to access a file stored on a physical volume attached to said second storage node.
7. The distributed storage system as set forth in claim 1, wherein said storage operation comprises a write operation.
8. The distributed storage system as set forth in claim 1, wherein said storage operation comprises a read operation.
9. A method for storing files in a distributed storage system, said method comprising the steps of: storing a plurality of files, accessible by a single file system, in a plurality of physical storage volumes; coupling a plurality of storage nodes to said physical storage volumes through said network; providing access to each one of said storage nodes to each of said physical storage volumes; receiving, at each of said storage nodes, a request from a client for a storage operation on a file identified in said file system; accessing at least one of said physical volumes in response to said storage operation; and transmitting a response to said storage operation to said client.
10. The method as set forth in claim 9, further comprising the steps of: coupling a load balancing switch to said network; receiving said request from a client for a storage operation; and selecting one of said client nodes to process said storage operation.
11. The method as set forth in claim 9, further comprising the step of coupling at least one of said physical storage volumes to each of said storage nodes.
12. The method as set forth in claim 9, further comprising the step of accessing said physical storage volumes through a network.
13. The method as set forth in claim 9, wherein: the step of receiving, at each of said storage nodes, a request from a client for a storage operation comprises the steps of: receiving a request at a file system manager; and processing said client request for said storage operation; the step of accessing at least one of said physical volumes in response to said storage operation comprises the step of accessing said physical volumes from a volume manager.
14. The method as set forth in claim 13, further comprising the steps of: generating a connection between a file system manager of a first storage node and a volume manger of a second storage node; and accessing a physical volume attached to said second storage node by said volume manger of said second storage node.
15. The method as set forth in claim 9, wherein said storage operation comprises a write operation.
16. The method as set forth in claim 9, wherein said storage operation comprises a read operation.
17. A computer readable medium for storing a plurality of instructions, which when executed by a computer system, causes the computer to perform the steps of: storing a plurality of files, accessible by a single file system, in a plurality of physical storage volumes; coupling a plurality of storage nodes to said physical storage volumes through said network; providing access to each one of said storage nodes to each of said physical storage volumes; receiving, at each of said storage nodes, a request from a client for a storage operation on a file identified in said file system; accessing at least one of said physical volumes in response to said storage operation; and transmitting a response to said storage operation to said client.
18. The computer readable medium as set forth in claim 17, further comprising the steps of: coupling a load balancing switch to said network; receiving said request from a client for a storage operation; and selecting one of said client nodes to process said storage operation.
19. The computer readable medium as set forth in claim 17, further comprising the step of coupling at least one of said physical storage volumes to each of said storage nodes.
20. The computer readable medium as set forth in claim 17, further comprising the step of accessing said physical storage volumes through a network.
21. The computer readable medium as set forth in claim 17, wherein: the step of receiving, at each of said storage nodes, a request from a client for a storage operation comprises the steps of: receiving a request at a file system manager; and processing said client request for said storage operation; the step of accessing at least one of said physical volumes in response to said storage operation comprises the step of accessing said physical volumes from a volume manager.
22. The computer readable medium as set forth in claim 21 , further comprising the steps of: generating a connection between a file system manager of a first storage node and a volume manger of a second storage node; and accessing a physical volume attached to said second storage node by said volume manger of said second storage node.
23. The computer readable medium as set forth in claim 17, wherein said storage operation comprises a write operation.
23. The computer readable medium as set forth in claim 17, wherein said storage operation comprises a read operation.
PCT/US2003/033175 2002-10-17 2003-10-17 A distributed network attached storage system WO2004036408A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003301379A AU2003301379A1 (en) 2002-10-17 2003-10-17 A distributed network attached storage system

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US41977802P 2002-10-17 2002-10-17
US60/419,778 2002-10-17
US10/368,026 2003-02-13
US10/367,541 2003-02-13
US10/367,541 US7509645B2 (en) 2002-10-17 2003-02-13 Methods and apparatus for load balancing storage nodes in a distributed network attached storage system
US10/367,436 2003-02-13
US10/368,026 US7774325B2 (en) 2002-10-17 2003-02-13 Distributed network attached storage system
US10/367,436 US7774466B2 (en) 2002-10-17 2003-02-13 Methods and apparatus for load balancing storage nodes in a distributed storage area network system

Publications (2)

Publication Number Publication Date
WO2004036408A2 true WO2004036408A2 (en) 2004-04-29
WO2004036408A3 WO2004036408A3 (en) 2004-12-29

Family

ID=32111033

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/033175 WO2004036408A2 (en) 2002-10-17 2003-10-17 A distributed network attached storage system

Country Status (2)

Country Link
AU (1) AU2003301379A1 (en)
WO (1) WO2004036408A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1643393A3 (en) * 2004-10-01 2007-01-17 Microsoft Corporation System and method for managing access to files in a distributed file system
CN1322427C (en) * 2005-02-25 2007-06-20 清华大学 Universal method for dynamical management of storage resource under Windows platform
US7774325B2 (en) 2002-10-17 2010-08-10 Intel Corporation Distributed network attached storage system
US8037058B2 (en) 2009-04-09 2011-10-11 Oracle International Corporation Reducing access time for data in file systems when seek requests are received ahead of access requests
WO2013162954A1 (en) * 2012-04-27 2013-10-31 Netapp, Inc. Efficient data object storage and retrieval
CN109783678A (en) * 2018-12-29 2019-05-21 深圳云天励飞技术有限公司 A kind of method and device of picture search
US10523753B2 (en) 2014-05-06 2019-12-31 Western Digital Technologies, Inc. Broadcast data operations in distributed file systems

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001067707A2 (en) * 2000-03-03 2001-09-13 Scale Eight, Inc. A network storage system
US20020083120A1 (en) * 2000-12-22 2002-06-27 Soltis Steven R. Storage area network file system
US20020133539A1 (en) * 2001-03-14 2002-09-19 Imation Corp. Dynamic logical storage volumes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001067707A2 (en) * 2000-03-03 2001-09-13 Scale Eight, Inc. A network storage system
US20020083120A1 (en) * 2000-12-22 2002-06-27 Soltis Steven R. Storage area network file system
US20020133539A1 (en) * 2001-03-14 2002-09-19 Imation Corp. Dynamic logical storage volumes

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774325B2 (en) 2002-10-17 2010-08-10 Intel Corporation Distributed network attached storage system
US7774466B2 (en) 2002-10-17 2010-08-10 Intel Corporation Methods and apparatus for load balancing storage nodes in a distributed storage area network system
EP1643393A3 (en) * 2004-10-01 2007-01-17 Microsoft Corporation System and method for managing access to files in a distributed file system
US7584220B2 (en) 2004-10-01 2009-09-01 Microsoft Corporation System and method for determining target failback and target priority for a distributed file system
CN1322427C (en) * 2005-02-25 2007-06-20 清华大学 Universal method for dynamical management of storage resource under Windows platform
US8037058B2 (en) 2009-04-09 2011-10-11 Oracle International Corporation Reducing access time for data in file systems when seek requests are received ahead of access requests
WO2013162954A1 (en) * 2012-04-27 2013-10-31 Netapp, Inc. Efficient data object storage and retrieval
US8793466B2 (en) 2012-04-27 2014-07-29 Netapp, Inc. Efficient data object storage and retrieval
US10523753B2 (en) 2014-05-06 2019-12-31 Western Digital Technologies, Inc. Broadcast data operations in distributed file systems
CN109783678A (en) * 2018-12-29 2019-05-21 深圳云天励飞技术有限公司 A kind of method and device of picture search
CN109783678B (en) * 2018-12-29 2021-07-20 深圳云天励飞技术有限公司 Image searching method and device

Also Published As

Publication number Publication date
WO2004036408A3 (en) 2004-12-29
AU2003301379A8 (en) 2004-05-04
AU2003301379A1 (en) 2004-05-04

Similar Documents

Publication Publication Date Title
US7509645B2 (en) Methods and apparatus for load balancing storage nodes in a distributed network attached storage system
US10459649B2 (en) Host side deduplication
US11880578B2 (en) Composite aggregate architecture
JP5026283B2 (en) Collaborative shared storage architecture
US7739250B1 (en) System and method for managing file data during consistency points
US7849274B2 (en) System and method for zero copy block protocol write operations
US7389382B2 (en) ISCSI block cache and synchronization technique for WAN edge device
US7747836B2 (en) Integrated storage virtualization and switch system
US7904649B2 (en) System and method for restriping data across a plurality of volumes
US20210342082A1 (en) Asynchronous semi-inline deduplication
WO2002050714A2 (en) A data storage system including a file system for managing multiple volumes
US8554867B1 (en) Efficient data access in clustered storage system
US20050193021A1 (en) Method and apparatus for unified storage of data for storage area network systems and network attached storage systems
WO2004036408A2 (en) A distributed network attached storage system
CN111868704B (en) Method for accelerating access to storage medium and apparatus therefor
US8996802B1 (en) Method and apparatus for determining disk array enclosure serial number using SAN topology information in storage area network
Wei et al. DWC 2: A dynamic weight-based cooperative caching scheme for object-based storage cluster
Wang et al. A general-purpose, intelligent RAID-based object storage device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 20038A62397

Country of ref document: CN

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP